ccBot

ccBot is CommonCrawl's crawler. ccBot is under active development and has an extensive feature roadmap for the next 12 months.

The following is a list of answers to Frequently Asked Questions.

How does the bot identify itself?

Our bot will identify itself with the following User-Agent string: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

How often does the bot access pages?

We aim to build a system that can maintain a fresh crawl of the web, but, for now, our crawling aims are more modest, and we intend not to overtax anyone's servers.

How can I ask for a slower crawl if the bot is taking up too much bandwidth?

We obey the crawl-delay the robots.txt convention, so by increasing that number, you will indicate to ccBot to slow down the rate of crawling.

How can I block this bot?

You configure your robots.txt file which uses the Robots Exclusion Protocol to block the crawler. Our bot's Exclusion User-Agent string is: ccbot.

How can I ensure this bot can crawl my site effectively?

We are working hard to add features to the crawl system and hope to support the sitemap protocol in the future.

Does the bot support conditional get/compression?

Although we currently do not support conditional get requests, we will be adding this support in a subsequent release of the crawler. We do currently support the gzip encoding format.

Why is the bot crawling pages I don't have links to?

The bot may have found your pages by following links from other sites.

What is the IP range of the bot?

38.103.63.16 through 38.103.63.18

Does the bot support nofollow?

As per the Wikipedia: The nofollow attribute value is not meant for blocking access to content or preventing content to be indexed by search engines. Instead, the nofollow attribute is primarily used by site authors to prevent Search Engines such as Google from having the source page's PageRank impact the PageRank of linked targets. We plan to ignore the nofollow for the purposes of restricting the crawler's ability to follow link targets. The proper methods for blocking search engine spiders to access content on a website or for preventing them to include the content of a page in their index are the Robots Exclusion Protocol (robots.txt).

What parts of robots.txt does the bot support?

We support Disallow as well as Disallow / Allow combinations. We also support the crawl-delay directive. We plan to support the sitemap directive in a future release.

What robots meta tags does the bot support?

We plan to support the NOINDEX, and NOFOLLOW meta-tags. NOINDEX will restrict our crawler from caching a copy of your web page in our archive, while NOFOLLOW will prevent our crawler from following link targets contained within the content of your web page.

What is the history of the ccBot crawler?

The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop and Nutch projects. We use Map-Reduce (via the open source Hadoop project) to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed via RPC to a set of spider (bot) servers. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.