About
Overview
Common Crawl is a nonprofit 501(c)(3) organisation that crawls the web and freely provides its archives and datasets to the public. Founded in 2007 by Gil Elbaz, we maintain an open repository of web crawl data collected since 2008, totalling more than 10 petabytes. Crawls are published approximately once a month, each typically containing more than two billion web pages.
The dataset is hosted on Amazon Web Services through its Open Data Sponsorship Program and can be downloaded at no cost. It has been cited in over 12,000 research papers and has become one of the most widely used sources of training data for large language models.
Common Crawl is a member of the International Internet Preservation Consortium (IIPC) and a partner in the End of Term Web Archive, which preserves US federal government websites during presidential transitions.
History
Founding and early operations
Technical Infrastructure
CCBot
Our web crawler, CCBot, is based on Apache Nutch and identifies itself via the User-Agent string CCBot. CCBot obeys robots.txt directives and rate-limits its requests to individual servers. Crawls process approximately three billion pages per cycle.
CCBot uses Harmonic Centrality, a graph-theoretic measure of a node's proximity to the structural core of the web, to prioritise URLs for crawling. This metric is computed from our Web Graph data and is used alongside PageRank.
Data Formats
Web Graphs
Common Crawl publishes host-level and domain-level Web Graphs derived from the hyperlink structure of our crawls, with computed Harmonic Centrality and PageRank values for each node. A list of releases is available on our S3 bucket.
Harmonic Centrality ranks are calculated via HyperBall, part of the WebGraph framework developed at the University of Milan by Paolo Boldi and Sebastiano Vigna.
Statistics for our Web Graphs can be found on the cc-webgraph-statistics page.
Hosting
Common Crawl's data is stored in the AWS us-east-1 (N. Virginia) region under the AWS Open Data Sponsorship Program. Data can be accessed via Amazon S3 or through Common Crawl's index server. We also publish experimental data products on Hugging Face, including the Common Crawl Citations dataset and crawl statistics.
Crawl Statistics
The table below shows the latest six crawls and their respective statistics. Further information can be found in the cc-crawl-statistics page.
Role in AI Development
Common Crawl data, typically after filtering and processing by third parties, has been a primary training data source for many large language models. A 2024 Mozilla Foundation study described the dataset as having "laid the infrastructural foundation" for the generative AI boom, and found that at least 64% of 47 large language models published between 2019 and 2023 were trained on filtered versions of Common Crawl data.
A 2025 paper at COLM found that compliance with robots.txt web crawling opt-outs does not degrade general knowledge acquisition in LLMs, reporting a close-to-zero "Data Compliance Gap."
Derived Datasets
Multilingual Initiatives
Common Crawl has undertaken several efforts to expand the linguistic diversity of its dataset. In December 2024, the organization launched the Web Languages Project to address the overrepresentation of English-language content, inviting speakers of Languages Other Than English (LOTE) to submit URLs via a public GitHub repository.
In early 2026, the organization released CommonLID, a language identification benchmark for web data covering 109 languages, developed with MLCommons, EleutherAI, and Johns Hopkins University through community annotation of over 350,000 lines of web text.
Organization
Governance and funding
Gil Elbaz serves as Chairman of the Board. Primary funding comes from the Elbaz Family Foundation. As of 2026, the organization has approximately 18 staff members.
Collaborators
Common Crawl collaborates with many organizations. If you're interested in becoming a collaborator, please get in touch.