Blog

The latest news, interviews, technologies, and resources.

June 2026 Crawl Archive Now Available

We are happy to announce the release of the June 2026 crawl archive, consisting of 2.10 billion web pages, or 354.59 TiB of uncompressed content.

Luca Foppiano

Luca Foppiano is a Senior Engineer at the Common Crawl Foundation.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Crawl Release

June 2026 Crawl Archive Now Available

We are happy to announce the release of the June 2026 crawl archive, consisting of 2.10 billion web pages, or 354.59 TiB of uncompressed content.

Luca Foppiano

Luca Foppiano is a Senior Engineer at the Common Crawl Foundation.

News

CommonLID Update: New Tools, Growing Impact

CommonLID, a community-built language ID benchmark, has a new website and interactive leaderboard. Its paper was accepted to ACL 2026, with a poster session on 7 July. Source code, a PyPI package, and the dataset are now available.

Laurie Burchell

Laurie is a Principal Research Engineer at the Common Crawl Foundation.

News

Common Crawl Foundation at IIPC-WAC 2026

Common Crawl was well represented with contributions at the 2026 IIPC Web Archiving Conference and General Assembly.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

News

The Columnar Index Is Now the URL Index

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Analysis

Introducing the AI Visibility Audit

A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.

Stephen Burns

Stephen Burns is Web Intelligence Lead at the Common Crawl Foundation.

Web Graphs

Host- and Domain-Level Web Graphs March, April, and May 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. The graphs consist of 262.4 million nodes and 8.1 billion edges at the host level, and 118.8 million nodes and 4.3 billion edges at the domain level.

Michael Paris

Michael is a Senior Research Engineer at the Common Crawl Foundation.

Crawl Release

May 2026 Crawl Archive Now Available

We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.

Michael Paris

Michael is a Senior Research Engineer at the Common Crawl Foundation.

News

April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.

Malte Ostendorff

Malte is a Senior Research Engineer at Common Crawl.

News

You can now build directly on Common Crawl from the browser

Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.

June 2026 Crawl Archive Now Available

June 2026 Crawl Archive Now Available

CommonLID Update: New Tools, Growing Impact

Common Crawl Foundation at IIPC-WAC 2026

The Columnar Index Is Now the URL Index

Introducing the AI Visibility Audit

Host- and Domain-Level Web Graphs March, April, and May 2026

May 2026 Crawl Archive Now Available

April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

You can now build directly on Common Crawl from the browser

Host- and Domain-Level Web Graphs February, March, and April 2026

April 2026 Crawl Archive Now Available

April 2026 Common Crawl Newsletter

Announcing a Change to Common Crawl Dataset Size Reporting

Host- and Domain-Level Web Graphs January, February, and March 2026

March 2026 Crawl Archive Now Available

IPv6 Adoption Across the Top 100K Web Hosts

Web Graph Statistics Gets a Proper Upgrade

Measuring Web Accessibility from Crawl Archives

Announcing the Whirlwind Tour of Common Crawl's Datasets Using Java

Host- and Domain-Level Web Graphs December 2025 and January/February 2026

Introducing the New Examples & Resources Browser

February 2026 Crawl Archive Now Available

AI Plumbers at FOSDEM’26

CC-Citations: A Visualization of Research Papers Referencing Common Crawl

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Host- and Domain-Level Web Graphs November/December 2025 and January 2026

January 2026 Crawl Archive Now Available

Web Archives for Social Sciences Datathon, Bristol

How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals

GneissWeb Annotations Examples

Common Crawl at the Mozilla Festival 2025

Host- and Domain-Level Web Graphs October, November, December 2025

December 2025 Crawl Archive Now Available

A Sampling of 2025 Research Referencing Common Crawl

Host- and Domain-Level Web Graphs September, October, and November 2025

November 2025 Crawl Archive Now Available

Common Crawl Celebrates World Digital Preservation Day

Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

October/November 2025 Newsletter

Common Crawl Foundation at Stanford HAI

Host- and Domain-Level Web Graphs August, September, and October 2025

October 2025 Crawl Archive Now Available

Common Crawl Foundation at COLM 2025

Announcing GneissWeb Annotations

Web Languages Needing Review by Native Speakers

Host- and Domain-Level Web Graphs July, August, and September 2025

From SEO to AIO: Why Your Content Needs to Exist in AI Training Data

September 2025 Crawl Archive Now Available

Common Crawl Foundation Opt-Out Registry

Trip Report: AI_dev (Linux Foundation) August 2025

Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

July/August 2025 Newsletter

Host- and Domain-Level Web Graphs June, July, and August 2025

August 2025 Crawl Archive Now Available

Common Crawl Foundation at ACL 2025

AI Optimization Is Here: Are You Ready for Search 2.0?

IETF 123 Report

Host- and Domain-Level Web Graphs May, June, and July 2025

July 2025 Crawl Archive Now Available

WMDQS Shared Task on Language Identification

The First WMDQS-Masakhane LangID Hackathon

Host- and Domain-Level Web Graphs April, May, and June 2025

Common Crawl at the United Nations Open Source Week, June 2025

June 2025 Crawl Archive Now Available

May/June 2025 Newsletter

Announcing the Whirlwind Tour of Common Crawl's Datasets using Python

Host- and Domain-Level Web Graphs March, April, and May 2025

May 2025 Crawl Archive Now Available

Announcing the First Workshop on Multilingual Data Quality Signals

Host- and Domain-Level Web Graphs February, March, and April 2025

April 2025 Crawl Archive Now Available

Introducing the Host Index

IIPC General Assembly & Web Archiving Conference 2025

March/April 2025 Newsletter

Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network

Host- and Domain-Level Web Graphs January, February, and March 2025

March 2025 Crawl Archive Now Available

Introducing Common Crawl AI Agent by ReadyAI

Submission to the UK’s Copyright and AI Consultation