Wednesday, June 13, 2012

hadoop/hive with tableu


It was in 2010 that  I had the first taste of hadoop/hive.  Back then I was still using hadoop 0.20 and was doing a proof of concept for a customer, who wanted to see if hadoop can be a solution for their problems.

Since then, I have been reading up and following the changes in the hadoop world, and tweaking things here and there with the home installation.   Today, I tried to mount hive on hadoop (without hbase, with hbase will be the next experiment) and see how can I get it playing nicely.

The experience is awesome as usual, and it reinforces my belief in the fact that hadoop ecosystem has a huge role to play in the computing industry of tomorrow.

The analytical capabilities of the volumes of data managed by the hadoop kind of system are ever increasing, therefore the interest from many instant BI players to provide access to the data behind hadoop.

One such player is the instant dashboard tool - Tableu.  They have announced that Tableu 7 will be able to read data directly from hive environments.  

In real life it was a bit of a challenge, but whats the fun if there is no challenge. In a nutshell, it does work. No doubt.  However, the kind of configuration that is required and administration can be tricky.

1. You have to install the hive driver (available from their website - http://www.tableausoftware.com/support/drivers)

2. you have to launch hive in a particular way, as a service. (hive --service hiverserver).  Also, hive on a pseudo cluster only allows one user connected (since the metadata store is single user on Derby).  as a result, if you are using Tableu connectivity, nothing else can access hive, not even a command line inerface.

3. Remember that each addition/change to the data set on tableu interface triggers a map-reduce job on the hive cluster/server.  And that, hadoop/hive are not really meant to be fast responsive systems.  Therefore, expect high delays in fulfilling your drag and drop requests.

4. There might/will be additional troubles in aggregating certain types of data, since the data types on hive might not be additive in the same way as front end expects them to be.

All in all, it wins me in the ease of use provided for accessing the data behind the hadoop environment, however, there are faster ways that already exist to achieve the same result.

Saturday, May 5, 2012

Mounting a aws instance into Nautilus in ubuntu

For a particular project I needed to transfer files from my local desktop system (running ubuntu ) to my aws instance running also running Ubuntu.  shell based connections work just fine for me, however, I need a GUI solution.  GUI solution was favoured since I wanted to have a quick drag and drop solution to move files rather easily across the two instances.

Although, Nautilus provides a "Connect to Server" option, and you can bookmark it as well, the trick/issue is that with aws instances you are dealing with private/public key  authentication, not password based authentication.

For some reason the "Connect to Server" option doesnt allow you to embed a public key and therefore that authentication method doesnt work through in that case.

After quite a bit of research and googling, I had to put this question up on stackoverflow.com  where, I got a prompt and sweet response.

The trick was to use the .ssh/config file and to create an alias there, specifying the identity file, and then onwards, there is no challenge to a ssh or sftp call, since ssh routes it through the config file automatically.

Here are the sample contents of the .ssh/config file that worked for me -

Host {alias}
HostName {hostname or IP address}
User {username to log in with}
IdentityFile {path to abcd.pem}
With this in place, its possible for me to have a bookmark and clicking on that just opens the target location (like a mount point).  Sweet little trick, and makes me like ubuntu a bit more...

Monday, April 30, 2012

Informatica - 1

HOW TO USE A COBOL FILE FOR TRANSFORMATION
 
Informatica allows reading data from cobol copybook formatted data files. These files mostly 
come from mainframe based source systems. Given that many of the world's leading business
systems still use IBM Mainframe as their computing systems, e.g. airlines, banks, insurance
companies etc, these systems act as a major source of information for Data warehouses,
and thus to our Informatica mappings. 
For using a cobol copy book structure as a source, you'd have to put that copybook in a 
empty skeleton cobol program.
IDENTIFICATION DIVISION.
PROGRAM-ID. RAGHAV.

ENVIRONMENT DIVISION.
SELECT FILE-ONE ASSIGN TO "MYFILE".

DATA DIVISION.
FILE SECTION.
FD FILE-ONE.

COPY "RAGHAV_COPYBOOK.CPY".

WORKING-STORAGE SECTION.

PROCEDURE DIVISION.

STOP RUN.

The copybook file can by a plain record structure.
Read more about defining copybooks around here.

Need of Scheduling and Commonly used Schedulers
 
Any and all Data warehousing environments need some kind of scheduler setup to enable
jobs being run at periodic intervals without human intervention. Another important feature
is the repeatability of the jobs set up such. Without the help of a scheduler, things would
become very ad-hoc and thus prone to errors and messups.
Oracle provides an built in scheduling facility, accessible through its dbms_scheduler package.
Unix provides basic scheduling facility using cron command. Similarly, Informatica also
provides basic scheduling facilities in the Workflow Manager client.
The features provided by these scheduling tools are fairly limited, often limited to launching
a job at a given time, providing basic dependency management etc.
However, in real time data warehousing solutions, the required functionality is lot more 
sophisticated than whats offered by these basic features. Therefore, the need for full
fledged scheduling tools, e.g. Tivoli Workload Scheduler, Redwood Cronacle, Control-M,
Cisco Tidal etc..
Most of these tools provide sophisticated launch control, dependency management features 
and therefore allow the data warehouse to be instrumented at finer levels.
Some of the tools, e.g. Tidal for informatica and Redwood for Oracle, provide support for
the Tools' API as well, therefore integrating even better with the corresponding tool.

Sunday, April 29, 2012

Starting off with Rails 3.2.2 on Ubuntu 11

After a long time, I picked up to build a web application for a friend. I havent worked in web apps for about 6-8 months now, and was quite rusty.

Ruby on Rails was a natural choice. However, when I issued the "gem update" it turned out that rails is on 3.2.2 these days, which means there has been at least 4 releases since I last worked with rails.

I typically work in a way that learning happens while working only.  Therefore, I just started building a new application and built a default scaffold (I know, its  an old habit of testing installation using scaffold)... scaffold generated, which means that the backward compatibility is still alive. Server started.. and bang, launched chrome to see how localhost:3000 looks like...

it bombed... there was this "ExecJS::RuntimeError" staring at my face...

It turned out that Ubuntu is the culprit and needs nodejs to be installed as a JS runtime engine is required and by default Ubuntu doesnt provide one.

after "sudo apt-get install nodejs", its all smooth and shiney...

Will share further...

Sunday, April 22, 2012

Life is once.. just this once...

Watching a movie and then reminiscing my own life's memories made me realize this.. rather abruptly... that, whatever is going to be, is going to be in this single life only.. there is not going to be another chance to fix situations.. relations.. or anything else...

If you have not had something in life, there is not going to be another chance to get it.. something like  second attempt at childhood/growing up... its probably better to get over it and move on in life. Or if possible, do something about it.

Saturday, April 14, 2012

Infrastructure in US

I was in US (west coast) for a few weeks.

During the stay there, I realized that the state of infrastructure in US is really not that great.  Especially the picture that is presented to the outside world, thats really skewed.

I'll just take one example. The railway network.  Try to position yourself in Los Angeles, and strip yourself of a car.  Now imagine your transportation options... its hard to find any, or if any, horribly skewed against effeciency.  This is just because of the lack of a decent railway network.

I didnt have a car there, and therefore was in the exact same situation.  I tried to visit a friend in San Francisco, and there was not a decent option that I could use.  The only option was to drive (which I couldnt since I didnt have a car), or fly, which is relatively costly. 

The point is, if you dont have a car in California region, its really impossible to imagine transportation.  When I asked this question to some colleagues there, and the response was even harder to understand than the transport situation.

I got to know that, the car companies had bought out the railway networks (which were private companies) and shut them down so that the car businesses can thrive.  As a result, slowly, the railway network died down, and all the "infrastructure" development happened for the benefit of the car users.  which I'll admit, its a great job done. Absolutely fantastic roads, other services etc.

Now compare this with a typical locality in Europe.  You'd find excellent roads, as well as train network. You can pick any country, and they have an average plus road as well as railway network.  That tells me that by having both of the transportation models, citizens gain and less fuel and money is spent.


I wonder, whether the US govt doesnt know this (hard to imagine), and if it does know about it, why doesnt it do anything to fix the situation ?

Thursday, April 12, 2012

App vs web browser based access to websites from devices

Every other website these days launches their own app as soon as they find a decent following among customers.  However, there are some things we need to watch out for when using apps as against browsing the same content over a web browser.

A browser is a relatively safe sandbox when it comes to executing website content and rendering it.  There really has to be a loophole in the browser engine for a website to exploit it and do weird things to your device, be it a phone, tablet or laptop/desktop.

On the other hand, when we install "apps" for websites, we provide them "permissions" to do things on our devices. This works nicely based on the trust foundation.  I trust the website, and therefore I trust their app to not do anything untoward to my device.  This trust, can be unfounded in cases, and lead to unknown actions/behaviours from apps.

Among the benefits of using apps, they  do provide a better user experience (in most of the cases) since the rendering is specific to the device.  Also, the apps provide lot more customised user interaction information to their base websites, thereby providing more and accurate and contextual intelligence about their usage. There are reports about apps stealing private information from the devices.

Most of the time, the reason for the app gaining access to information is the grain of the access control used in the device. If its too low, the access permissions to be provided are a huge list, if its too high, you can provide too much access without intending to. There comes the maturity of the device operating system.

With web browsers the information sent back for analytics purposes is rather generic, since its from the browser sandbox.

I believe its safe to say that using apps is a bit of a trade-off, between the user experience and the safety /privacy of the user.  Lets be a bit more careful about which apps do we download and use, and what all permissions that app needs. Lets just be a bit more skeptic and end up being safer for it, hopefully.

PS - There are a lot other comparisons that already exist, however, its hard to say how many of them talk about security aspects. User Experience is one major discussions point, for sure.  Try Googling it

Monday, April 2, 2012

time zone conversion in Oracle

Often times we need to see a given time-stamp column in a particular time zone.   Without casting as well, oracle allows a very simple way - 


SELECT <column-name> AT TIME ZONE <time zone> FROM <table-name>;

this method saves the expensive cast operations.

Informatica Questions from a friend - Part 2 - Schedulers

Need of Scheduling and Commonly used Schedulers
 
Any and all Data warehousing environments need some kind of scheduler setup to enable
jobs being run at periodic intervals without human intervention. Another important feature
is the repeatability of the jobs set up such. Without the help of a scheduler, things would
become very ad-hoc and thus prone to errors and messups.
Oracle provides an built in scheduling facility, accessible through its dbms_scheduler package.
Unix provides basic scheduling facility using cron command. Similarly, Informatica also
provides basic scheduling facilities in the Workflow Manager client.
The features provided by these scheduling tools are fairly limited, often limited to launching
a job at a given time, providing basic dependency management etc.
However, in real time data warehousing solutions, the required functionality is lot more 
sophisticated than whats offered by these basic features. Therefore, the need for full
fledged scheduling tools, e.g. Tivoli Workload Scheduler, Redwood Cronacle, Control-M,
Cisco Tidal etc..
Most of these tools provide sophisticated launch control, dependency management features 
and therefore allow the data warehouse to be instrumented at finer levels.
Some of the tools, e.g. Tidal for informatica and Redwood for Oracle, provide support for
the Tools' API as well, therefore integrating even better with the corresponding tool.

Thursday, March 29, 2012

Informatica Questions from a friend - Part 1

HOW TO USE A COBOL FILE FOR TRANSFORMATION
 
Informatica allows reading data from cobol copybook formatted data files. These files mostly 
come from mainframe based source systems. Given that many of the world's leading business
systems still use IBM Mainframe as their computing systems, e.g. airlines, banks, insurance
companies etc, these systems act as a major source of information for Data warehouses,
and thus to our Informatica mappings. 
For using a cobol copy book structure as a source, you'd have to put that copybook in a 
empty skeleton cobol program.
IDENTIFICATION DIVISION.
PROGRAM-ID. RAGHAV.

ENVIRONMENT DIVISION.
SELECT FILE-ONE ASSIGN TO "MYFILE".

DATA DIVISION.
FILE SECTION.
FD FILE-ONE.

COPY "RAGHAV_COPYBOOK.CPY".

WORKING-STORAGE SECTION.

PROCEDURE DIVISION.

STOP RUN.

The copybook file can by a plain record structure.
Read more about defining copybooks around here.
 
 
 

Wednesday, March 28, 2012

Counting columns in a tab delimited file


It sounds so simple, however, when you sit down to write this, specially as a single line expression, it can take a while.

In my experiments I found it rather easy to count it with other delimiters as compared to TAB character.
Here is the command for counting columns

cat <FILENAME>| awk -F'\t' '{print NF}'

 cat can be slow at times, especially with larger files, therefore an alternative without that...

awk -F'\t' '{print NF}' <FILENAME>


 

Tuesday, March 13, 2012

how to find sql id of a long running sql in oracle


Sql for finding out sql id etc details of some long running query. Often useful for sending kill/monitoring instructions to DBA friends..

select distinct t.sql_id, s.inst_id,s.sid, s.serial#,s.osuser, s.program, s.status, t.sql_text
from gv$session s, gv$sqlarea t
where s.username = '<USERNAME>'
and s.sql_id = t.sql_id
and t.sql_text like '<%provide a segment of sql to identify it%>'

Thursday, March 8, 2012

Partitioning in Informatica

Sourced from Internet

All transformations have some basic counters that indicate the number of input rows, output rows, and error rows.
Source Qualifier, Normalizer, and target transformations have additional counters that indicate the efficiency of data moving into and out of buffers. Use these counters to locate performance bottlenecks.
Some transformations have counters specific to their functionality. For example, each Lookup transformation has a counter that indicates the number of rows stored in the lookup cache.
When you view the performance details file, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column holds the resulting number or efficiency percentage. If you use a Joiner transformation, the first column shows two instances of the Joiner transformation:

<Joiner transformation> [M]. Displays performance details about the master pipeline of the Joiner transformation.
<Joiner transformation> [D]. Displays performance details about the detail pipeline of the Joiner transformation.

When you create multiple partitions, the Integration Service generates one set of counters for each partition. The following performance counters illustrate two partitions for an Expression transformation:
Transformation Name
Counter Name
Counter Value
EXPTRANS [1]
Expression_input rows
8
Expression_output rows
8
EXPTRANS [2]
Expression_input rows
16
Expression_output rows
16
Note: When you increase the number of partitions, the number of aggregate or rank input rows may be different from the number of output rows from the previous transformation.
The following table describes the counters that may appear in the Session Performance Details area or in the performance details file:
Transformation
Counters
Description
Aggregator and Rank Transformations
Aggregator/Rank_inputrows
Number of rows passed into the transformation.
Aggregator/Rank_outputrows
Number of rows sent out of the transformation.
Aggregator/Rank_errorrows
Number of rows in which the Integration Service encountered an error.
Aggregator/Rank_readfromcache
Number of times the Integration Service read from the index or data cache.
Aggregator/Rank_writetocache
Number of times the Integration Service wrote to the index or data cache.
Aggregator/Rank_readfromdisk
Number of times the Integration Service read from the index or data file on the local disk, instead of using cached data.
Aggregator/Rank_writetodisk
Number of times the Integration Service wrote to the index or data file on the local disk, instead of using cached data.
Aggregator/Rank_newgroupkey
Number of new groups the Integration Service created.
Aggregator/Rank_oldgroupkey
Number of times the Integration Service used existing groups.
Lookup Transformation
Lookup_inputrows
Number of rows passed into the transformation.
Lookup_outputrows
Number of rows sent out of the transformation.
Lookup_errorrows
Number of rows in which the Integration Service encountered an error.
Lookup_rowsinlookupcache
Number of rows stored in the lookup cache.
Joiner Transformation
(Master and Detail)
Joiner_inputMasterRows
Number of rows the master source passed into the transformation.
Joiner_inputDetailRows
Number of rows the detail source passed into the transformation.
Joiner_outputrows
Number of rows sent out of the transformation.
Joiner_errorrows
Number of rows in which the Integration Service encountered an error.
Joiner_readfromcache
Number of times the Integration Service read from the index or data cache.
Joiner_writetocache
Number of times the Integration Service wrote to the index or data cache.
Joiner_readfromdisk*
Number of times the Integration Service read from the index or data files on the local disk, instead of using cached data.
Joiner_writetodisk*
Number of times the Integration Service wrote to the index or data files on the local disk, instead of using cached data.
Joiner_readBlockFromDisk**
Number of times the Integration Service read from the index or data files on the local disk, instead of using cached data.
Joiner_writeBlockToDisk**
Number of times the Integration Service wrote to the index or data cache.
Joiner_seekToBlockInDisk**
Number of times the Integration Service accessed the index or data files on the local disk.
Joiner_insertInDetailCache*
Number of times the Integration Service wrote to the detail cache. The Integration Service generates this counter if you join data from a single source.
Joiner_duplicaterows
Number of duplicate rows the Integration Service found in the master relation.
Joiner_duplicaterowsused
Number of times the Integration Service used the duplicate rows in the master relation.
All Other Transformations
Transformation_inputrows
Number of rows passed into the transformation.
Transformation_outputrows
Number of rows sent out of the transformation.
Transformation_errorrows
Number of rows in which the Integration Service encountered an error.
*The Integration Service generates this counter when you use sorted input for the Joiner transformation.
**The Integration Service generates this counter when you do not use sorted input for the Joiner transformation.
If you have multiple source qualifiers and targets, evaluate them as a whole. For source qualifiers and targets, a high value is considered 80-100 percent. Low is considered 0-20 percent.