April 6, 2026

April 2026 Common Crawl Newsletter

Check out our newsletter for April 2026, with updates on what we've been up to.

Jen English is a seasoned professional with a core competency in web content curation, web crawling, taxonomies, and ontology creation.

Here at Common Crawl we have been busy in the first quarter of 2026 improving our tools, building new resources, publishing studies, and, of course, releasing our monthly crawl archives and our host and domain level Web Graphs.

Web Graph Statistics Gets a Proper Upgrade

Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR filtering, and merged degree plots. More details in our blog post.

Introducing the New Examples & Resources Browser

We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.

Here it is in action:

Open in a new window

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

In February, we announced the release of CommonLID, a language identification benchmark for the web, covering 109 languages. CommonLID was developed in collaboration with multiple open-source organizations and language community groups. See our announcement for more.

CC-Citations: A Visualization of Research Papers Referencing Common Crawl

We released an interactive visualization of thousands of research papers using or citing Common Crawl data. Learn more in our blog post, and browse the visualization on our Hugging Face space.

Measuring Web Accessibility from Crawl Archives

*"What can we learn about accessibility from crawl archives?"*

A WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive found that four in ten colour pairings fall short of accessibility thresholds. Only one in five sites are fully compliant. Read the paper on arXiv, and find more details and links on our blog post.

IPv6 Adoption Across the Top 100K Web Hosts

We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail. Read more on our blog post.

Whirlwind Tour of Common Crawl's Datasets Using Java

We introduced whirlwind-java, the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data workflows. This follows our Whirlwind Tour in Python. Keep an eye out for our Rust version, coming in the future.

This release was authored by:

Jen English

Jen English is a seasoned professional with a core competency in web content curation, web crawling, taxonomies, and ontology creation.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

April 2026 Common Crawl Newsletter

Web Graph Statistics Gets a Proper Upgrade

Introducing the New Examples & Resources Browser

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

CC-Citations: A Visualization of Research Papers Referencing Common Crawl

Measuring Web Accessibility from Crawl Archives

IPv6 Adoption Across the Top 100K Web Hosts

Whirlwind Tour of Common Crawl's Datasets Using Java

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use