Monday, September 22, 2008

cuil.com a new search engine is banned by more than 10,000 sites

Be careful while you debug your crawler...

Cuil.com
Cuil (pronounced [kuːl], "cool", according to the creators) is a search engine that organizes web pages by content and displays relatively long entries along with thumbnail pictures for many results. It claims to have a larger index than any other search engine, with about 120 billion web pages.[1] It went live on July 28, 2008.

I doubt it will be take over the big 4 search engine , this search launching date, the server is hang and down for more than 3 hours............. a big joke right hahah.

Cuil's privacy policy, unlike that of other search engines,says it does not store users’ search activity or IP addresses.

Cuil is managed and developed largely by former employees of Google (or we can call it google betrayer hahaha): Anna Patterson, Russell Power and Louis Monier, who has since quit the company. The CEO and co-founder, Tom Costello, has worked for IBM and others. The company raised $33 million from venture capital firms including Greylock and i think the most it can last for 3 years.


the search startup have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans.

A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast and too heavy, don't crawl any single IP address too fast, it pull too much bandwidth from server. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.

Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.

My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal?

No comments: