There is a new crawler on the block -- the VisBot has been making its rounds. The seemingly legitimate crawler led me to its companies site: Visvo -- and I like what I've found. Why does Visvo matter among this new wave of search start ups? Three reasons: 1) Each search result has an explain link that details the technical details of how that was result was scored 2) They are using open Source Search Technology Nutch, and Hadoop both of the Lucene project 3) They are building their own index instead of using a feed like for example Quintura.
The Dallas based startup explains its name as the Sanskrit word for "universe", and their tag line is: "its our universe". After the customary grandiose and unqualified statements about how advanced they are Visvo claims to have developed the "automatic categorization search engine for web content". While this implementation of this wasn't immediately obvious to me after trying their search engine, I decided I like them as a company for other reasons.
I immediately did my favorite search -- a vanity search on my full name (should be fairly unique). The Visvo engine returned two (and once three) results that were highly relevant: The "Debian as a Desktop System" article I wrote for the Free Software Magazine, and a ping back for a blog entry I wrote about the upcoming FreeBSD installer "finstall" . Here is a screen shot of the results page for my vanity search above (Click on image for full resolution):
Each search result has an "explain" link (see screen shots ) where so-inclined users can view the technical details of how the search result was ranked by the Visvo engine. It shows factors such as the tfidf score of the document, which is probably the most used statistical model for measuring the frequency of a term in a document relative to documents in the rest of the corpus. (Click on screen shot for full resolution)
The fact that they are building their own index is relevant, todays search giants have amassed too much power -- in essence a monopoly on who finds what -- and my hope for the future is that startups like Visvo can challenge the entrenched incumbents. The fact that they also open up their ranking is important, and this level of transparency is something Google could learn from.
The technical presentation by Dennis Kubes, founder and CTO of Visvo (linked at the bottom of the page) gives us some juicy details about their architecture. Visvo is indeed using Hadoop, the open source Map-Reduce implementation, occasionally supplements its dedicated hardware with Amazons EC2 hardware service.
I'm impressed so far. I wish them success, as their success is also the success of the open source community, more so than with Google.
I encourage everyone to give it a go, although the product is still in Alpha, so probably won't be ready for your day to day search needs for a while.