Its a catchy title, but yes.. thats what I am going to talk about...
I came across hadoop, when I was looking for a new solution for one of our in-house projects. The need was quite clear, however, the solution had to be dramatically different.
The one statement we received from business was, "We need an exceptionally fast search interface". And for that fast interface to search upon they had more than a hundred million rows worth of data in a popular RDBMS.
So, when I sat about thinking, how to make a fast search application, the first thing that came to my mind was, Google. Actually, whenever we talk about speed or performance of web sites, Google is invariably the first name that comes across.
Further, Google has a plus point that there is always some activity at the back end to generate the page or results that we see, its never static content. And, then, another point, Google has a few trillion pieces of information to store/index/search whereas our system was going to have significantly lower volume of data to manage. So, going with that, Google looked like a very good benchmark for this fast search application.
Then I started to look for "How Google generates that kind of performance". There are quite a few pages on the web talking about just that. But, probably none of them has the definitive/authoritative view on Google's technology or for that matter the insider's view on how it actually does what it does so fast.
Some pages pointed towards their storage technology, some talked about their indexing technology, some about their access to huge volumes of high performance hardware and what not...
For me, some of them turned out to be genuinely interesting, one of them was the indexing technology. There has to be a decent indexing mechanism to which the crawler's would feed and the search algorithms hit. The storage efficiency is probably the next thing to come in the play. How fast can they access the corresponding item ?
Another of my observation is that, the search results (the page mentioning page titles and stuff) comes real fast, mostly less than 0.25 seconds, but the click on the links does take some time. So, I think it has to be their indexing methodology that plays the bigger role.
With that in mind, I sat about finding what can do similar things and how much of Google's behaviour they can simulate/implement.
Then I found Hadoop project on apache (http://hadoop.apache.org/) which to a large extent reflects the way Google kind of system would work. It provides distributed computing(hadoop core), it provides a bigTable kind of database (hbase), provides map/reduce layer, and more. Reading into it more, I figured out that this system is nice for a batch processing kind for mechanism, but not for our need of real time search.
Then I found solr(http://lucene.apache.org/solr/), a full text search engine under Apache Lucene. It is a java written, xml indexing based genuinely fast search engine. It provides many features that we normally wish for in more commercial applications, an being from apache, I would like to think of it as much more reliable and stable than compared to many others.
When we sat about doing a Proof of Concept with it, I figured out a few things –
• It supports only one schema, as in, rdbms tables – only one. So, basically you would have to denormalize all your content to fit into this one flat structure.
• It supports interactions with the server interface only through http methods be it the standard methods get/put etc or be it REST like interfaces.
• It allows you loading data in varying formats, through xml documents, through delimited formats and through db interactions as well.
• It has support for clustering as well. Either you can host it on top of something like hadoop or you can just configure it to do it within solr as well.
• It supports things like expression and function based searches
• It supports faceting
• Extensive caching and “partitioning” features.
Besides other features, the kind of performance without any specific tuning efforts made me think of it as a viable solution.
In a nutshell, I loaded around 50 million rows on a “old” Pentium-D powered desktop box with 3 GB RAM running ubutnu 10.04 server edition (64 bit) with two local hard disks configured over a logical volume manager.
The loading performance was not quite great. Though its not that bad either. I was able to load a few million rows (in a file that was sized about 6 GB) in about 45 minutes when the file was on the same file system.
In return, it gave me query performances in the range of 2-4 seconds for the first query. For subsequent re-runs of the same query (within a span of an hour or so), it came back in approx 1-2 milliseconds. I would like to think that its pretty great performance given the kind of hardware I was running upon, and the kind of tuning effort I put in (basically none – zero, I just ran the default configuration).
Given that, I wont say that I have found the equivalent or replacement of Google’s search for our system, but yeah, we should be doing pretty good with this.
Although there is more testing and experimentation that is required to be able to judge solr better, the initial tests look pretty good.. pretty much in line with the experiences of others who are using it.