Wednesday, June 13, 2012

hadoop/hive with tableu


It was in 2010 that  I had the first taste of hadoop/hive.  Back then I was still using hadoop 0.20 and was doing a proof of concept for a customer, who wanted to see if hadoop can be a solution for their problems.

Since then, I have been reading up and following the changes in the hadoop world, and tweaking things here and there with the home installation.   Today, I tried to mount hive on hadoop (without hbase, with hbase will be the next experiment) and see how can I get it playing nicely.

The experience is awesome as usual, and it reinforces my belief in the fact that hadoop ecosystem has a huge role to play in the computing industry of tomorrow.

The analytical capabilities of the volumes of data managed by the hadoop kind of system are ever increasing, therefore the interest from many instant BI players to provide access to the data behind hadoop.

One such player is the instant dashboard tool - Tableu.  They have announced that Tableu 7 will be able to read data directly from hive environments.  

In real life it was a bit of a challenge, but whats the fun if there is no challenge. In a nutshell, it does work. No doubt.  However, the kind of configuration that is required and administration can be tricky.

1. You have to install the hive driver (available from their website - http://www.tableausoftware.com/support/drivers)

2. you have to launch hive in a particular way, as a service. (hive --service hiverserver).  Also, hive on a pseudo cluster only allows one user connected (since the metadata store is single user on Derby).  as a result, if you are using Tableu connectivity, nothing else can access hive, not even a command line inerface.

3. Remember that each addition/change to the data set on tableu interface triggers a map-reduce job on the hive cluster/server.  And that, hadoop/hive are not really meant to be fast responsive systems.  Therefore, expect high delays in fulfilling your drag and drop requests.

4. There might/will be additional troubles in aggregating certain types of data, since the data types on hive might not be additive in the same way as front end expects them to be.

All in all, it wins me in the ease of use provided for accessing the data behind the hadoop environment, however, there are faster ways that already exist to achieve the same result.