CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
February 10, 2026
We are excited to announce the release of CommonLID, a language identification benchmark for the web, covering 109 languages. CommonLID was developed in collaboration with multiple open-source organizations and language community groups.
Read More...Host- and Domain-Level Web Graphs November/December 2025 and January 2026
January 30, 2026
The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.
Read More...January 2026 Crawl Archive Now Available
January 28, 2026
We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.
Read More...Web Archives for Social Sciences Datathon, Bristol
January 26, 2026
Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.
Read More...How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals
January 19, 2026
As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.
Read More...GneissWeb Annotations Examples
January 13, 2026
A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.
Read More...Common Crawl at the Mozilla Festival 2025
January 5, 2026
From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.
Read More...Host- and Domain-Level Web Graphs October, November, December 2025
December 23, 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.
Read More...December 2025 Crawl Archive Now Available
December 20, 2025
The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).
Read More...A Sampling of 2025 Research Referencing Common Crawl
December 5, 2025
As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.
Read More...Host- and Domain-Level Web Graphs September, October, and November 2025
November 24, 2025
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.
Read More...November 2025 Crawl Archive Now Available
November 23, 2025
We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.
Read More...Common Crawl Celebrates World Digital Preservation Day
November 6, 2025
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?
Read More...Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good
November 4, 2025
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
Read More...October/November 2025 Newsletter
November 3, 2025
Check out our newsletter for October/November 2025, with updates on what we've been up to
Read More...Common Crawl Foundation at Stanford HAI
October 27, 2025
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
Read More...Host- and Domain-Level Web Graphs August, September, and October 2025
October 25, 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.
Read More...October 2025 Crawl Archive Now Available
October 23, 2025
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.
Read More...Common Crawl Foundation at COLM 2025
October 20, 2025
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.
Read More...Announcing GneissWeb Annotations
October 6, 2025
Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.
Read More...Web Languages Needing Review by Native Speakers
September 29, 2025
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.
Read More...Host- and Domain-Level Web Graphs July, August, and September 2025
September 25, 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.
Read More...From SEO to AIO: Why Your Content Needs to Exist in AI Training Data
September 23, 2025
The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers, a shift that's already driving real business impact today and making presence in AI training data as strategically vital as traditional search rankings once were.
Read More...September 2025 Crawl Archive Now Available
September 22, 2025
We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.
Read More...Common Crawl Foundation Opt-Out Registry
September 17, 2025
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
Read More...