ccBot ccBot is CommonCrawl's crawler. ccBot is under active development and has an extensive feature roadmap for the next 12 months. The following is a list of answers to Frequently Asked Questions. How does the bot identify itself? Our bot will identify itself with the following User-Agent string: CCBot/1.0 (+http://www.commoncrawl.org/bot.html) How often does the bot access pages? We aim to build a system that can maintain a fresh crawl of the web, but, for now, our crawling aims are more modest, and we intend not to overtax anyone's servers. How can I ask for a slower crawl if the bot is taking up too much bandwidth? We obey the crawl-delay the robots.txt convention, so by increasing that number, you will indicate to ccBot to slow down the rate of crawling. How can I block this bot? You configure your robots.txt file which uses the Robots Exclusion Protocol to block the crawler. Our bot's Exclusion User-Agent string is: ccbot. How can I ensure this bot can crawl my site effectively? We are working hard to add features to the crawl system and hope to support the sitemap protocol in the future. Does the bot support conditional get/compression? Although we currently do not support conditional get requests, we will be adding this support in a subsequent release of the crawler. We do currently support the gzip encoding format. Why is the bot crawling pages I don't have links to? The bot may have found your pages by following links from other sites. What is the IP range of the bot? 38.107.179.200 through 38.107.179.245 Does the bot support nofollow? Currently, we do honor the nofollow attribute as it applies to links embedded on your site. It should be noted that the nofollow attribute value is not meant for blocking access to content or preventing content to be indexed by search engines. Instead, the nofollow attribute is primarily used by site authors to prevent Search Engines such as Google from having the source page's PageRank impact the PageRank of linked targets. If we ever did ignore nofollow in the future, we would do so only for the purposes of link discovery and would never create any association between the discovered link and the source document. What parts of robots.txt does the bot support? We support Disallow as well as Disallow / Allow combinations. We also support the crawl-delay directive. We plan to support the sitemap directive in a future release. What robots meta tags does the bot support? We support the NOFOLLOW meta-tag. What is the history of the ccBot crawler? The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop project and some parts of the Apache Nutch project. We use Map-Reduce to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of spider (bot) servers. We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database. What do you intend to do with the crawled content? Our mission is to build, maintain and make widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research. Access to such a crawl is a necessity for many companies, yet it is a resource that is available to only a few, very large, organizations. We are attempting to level the playing field a little bit, and as such, are developing the systems and processes to enable us to make our crawl accessible to as wide an audience as possible. We will make more information available on our website as we make progress towards this goal.