Super fast full text search on Rails with Sphinx and Ultrasphinx | January 30th, 2008
Full text search is difficult, I mean really hard to do right, with good relevance, filterable, fast querying and indexing. Searching is just not something that any business should be trying to solve as a core objective unless they are planning to taking on the Google… or they have reams of unstructured data in weird silos without any other way of accessing it. Search simply should not replace solid Information Architecture. Content should be inherently findable without an ambiguous search box. The use cases for a powerful full text search are those very rare instances where there is either just a tonne of content that is too difficult to classify or you are interacting with content that has little or no structure, metadata or taxonomy.
And, of course, my next big project, required full text search. Massive amounts of data. Very little structure. Disparate content sources.
In past projects I’ve had some limited success, and spectacular failure, with Ferret, a Ruby port of the Java based Lucene search. It was slow, leaked memory and generally caused more pain than it should.
The next full text search project I ran into our client decided that the Google Mini Search Appliance would be the cheapest and quickest the way to go. He was right. It provides an out of the box search interface experience with fantastic relevance. This is mighty Google in a box afterall. The downside? Google Mini has limited customization without cycling through reams of skanky XSL/T. Deep integration is nearly impossible. That’s an ok tradeoff depending on your goals. For our project it just wouldn’t work.
My next thought was a project called Solr which is essentially an abstraction layer on top of (again) Lucene. Solr has an accompanying Rails plugin called, appropriately, acts_as_solr. The only issue, I could immediately see with Solr, is having to maintain a second server instance, which is generally something you’ll nearly always have to do for full text search.
Then Sphinx happens to appear on my radar via the excellent Deploying Rails Google Group. To make a long story short here’s a couple of quotes from Ezra Zygmuntowicz, of Engine Yard fame, who literally wrote the book on deploying rails applications:
Ferret is unstable in production. Segfaults, corrupted indexes galore. We’ve switched around 40 clients form ferret to sphinx and solved their problems this way. I will never use ferret again after all the problems I have seen it cause peoples production apps. Plus sphinx can reindex many many times faster then ferret and uses
less cpu and memory as well.
Right about here you can hear me say in a western Canadian accent, “Sphinx eh..”. Ezra continues in a later post:
We have a bunch of clients using solr as well. In general it is more powerful then sphinx but a lot slower to reindex and querey. Also it uses 50 times the memory of sphinx. If you have a box or vm to put SOLR on by itself then it is a good option as well. but if sphinx can do everything you need from a a search indexer then it is a way better option cost wise.
Ok. I was fully intrigued in this crazy Sphinx. A little searching around I came across an excellent article from Ben Smith about using Rails with Sphinx. Following his instructions for compiling with support for MySQL 5 installed via Darwin Ports was pretty easy. I chose the Ultrasphinx plugin which makes working with Sphinx trivial. He mentions acts_as_sphinx and the predictably named Sphincter which I plan to revisit eventually.
So, how is Sphinx with Ultrasphinx? Awesome. I can’t quite reveal what exactly I’m working on, but I can say that I had recently downloaded all the page titles from Wikipedia and imported roughly 2.4 million of them into MySQL 5 into three columns. The first column is an integer index. The second is the raw page title. The final column is the page title with formatting removed and underscores replaced by spaces. You know, human readable and all that.
Running a hand rake task for indexing outputs:
collected 2409850 docs, 67.8 MB
sorted 8.9 Mhits, 100.0% done
total 2409850 docs, 67771103 bytes
total 34.220 sec, 1980424.75 bytes/sec, 70421.26 docs/sec
Roughly 34 seconds to index 2.4 million records. Not bad. I’ve had faster times so it might be time for me to reboot the old macbook. But what about actual searching? Consistently under a second. Relevance? Very good.
The verdict: Sphinx rocks.