< Back to Blog
September 17, 2025

Common Crawl Foundation Opt-Out Registry

Note: this post has been marked as obsolete.
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
Common Crawl Foundation
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.

Why We're Doing This

We have many downstream users who are not aware of the legal notices we have received. On behalf of the broader ecosystem, we want to alert our users about material that content owners have specifically requested to be excluded from our crawls.

About Common Crawl

Common Crawl is a free, non-commercial archive of the public web that has operated for 18 years. We have never charged for access to our data, and our datasets have been cited in over 10,000 research papers.

Each month, we crawl several billion new pages from our frontier of over 1 trillion pages, supporting research in web science, natural language processing, internet security research, the humanities, and other fields.

Our Commitment

We respect content owners' rights and are committed to honoring legitimate opt-out requests. We encourage publishers to use standard methods like robots.txt files to control crawling, and we will continue to process legal requests as they are received.

Moving Forward

We will be updating our "Opt-Out Registry" on an ongoing basis as we receive new requests. This initial publication represents our commitment to transparency with both content creators and the research community that relies on our data.

For questions about opt-out procedures or to submit requests, please contact us at info@commoncrawl.org.

To configure how our crawler (CCBot) interacts with your website, visit: https://commoncrawl.org/ccbot

The Common Crawl Opt-Out registry can be viewed here.

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.