September 17, 2025

Common Crawl Foundation Opt-Out Registry

Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Why We're Doing This

We have many downstream users who are not aware of the legal notices we have received. On behalf of the broader ecosystem, we want to alert our users about material that content owners have specifically requested to be excluded from our crawls.

About Common Crawl

Common Crawl is a free, non-commercial archive of the public web that has operated for 18 years. We have never charged for access to our data, and our datasets have been cited in over 10,000 research papers.

Each month, we crawl several billion new pages from our frontier of over 1 trillion pages, supporting research in web science, natural language processing, internet security research, the humanities, and other fields.

Our Commitment

We respect content owners' rights and are committed to honoring legitimate opt-out requests. We encourage publishers to use standard methods like robots.txt files to control crawling, and we will continue to process legal requests as they are received.

Moving Forward

We will be updating our "Opt-Out Registry" on an ongoing basis as we receive new requests. This initial publication represents our commitment to transparency with both content creators and the research community that relies on our data.

For questions about opt-out procedures or to submit requests, please contact us at info@commoncrawl.org.

To configure how our crawler (CCBot) interacts with your website, visit: https://commoncrawl.org/ccbot

View the Common Crawl Opt-Out Registry.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Common Crawl Foundation Opt-Out Registry

Why We're Doing This

About Common Crawl

Our Commitment

Moving Forward

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use