Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
Why We're Doing This
We have many downstream users who are not aware of the legal notices we have received. On behalf of the broader ecosystem, we want to alert our users about material that content owners have specifically requested to be excluded from our crawls.
About Common Crawl
Common Crawl is a free, non-commercial archive of the public web that has operated for 18 years. We have never charged for access to our data, and our datasets have been cited in over 10,000 research papers.
Each month, we crawl several billion new pages from our frontier of over 1 trillion pages, supporting research in web science, natural language processing, internet security research, the humanities, and other fields.
Our Commitment
We respect content owners' rights and are committed to honoring legitimate opt-out requests. We encourage publishers to use standard methods like robots.txt files to control crawling, and we will continue to process legal requests as they are received.
Moving Forward
We will be updating our "Opt-Out Registry" on an ongoing basis as we receive new requests. This initial publication represents our commitment to transparency with both content creators and the research community that relies on our data.
For questions about opt-out procedures or to submit requests, please contact us at info@commoncrawl.org.
To configure how our crawler (CCBot) interacts with your website, visit: https://commoncrawl.org/ccbot
The Common Crawl Opt-Out registry can be viewed here.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.