Common Crawl - Blog - Common Crawl Foundation at ACL 2025

Common Crawl Foundation at ACL 2025

Note: this post has been marked as obsolete.

The Common Crawl team attended the 63rd Annual Meeting of the Association of Computational Linguistics in Vienna, presenting recent published work and strengthening links with the research community.

Laurie Burchell

Laurie is a Senior Research Engineer with Common Crawl.

Research With and By Common Crawl

Many of the papers featured at ACL 2025 made use of Common Crawl’s data products, either indirectly through their use of large language models (LLMs) trained on our data, or directly as a key part of their research. To give just one example, in "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" (Su et al., 2025), the authors use our data as the basis for a high-quality English LLM training dataset, leveraging smart data filtering and synthetic rephrasing to improve downstream task performance. We also had a lot of very positive informal feedback from attendees: many told us about the value of Common Crawl’s open data and how it was a key part of making their research happen.

We were also very pleased to have three papers by Common Crawl team members presented at ACL!

Laurie Burchell with a poster on HPLT v2

"An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)" (Burchell et al., 2025): presenting HPLT v2, a large-scale collection of high-quality multilingual monolingual and parallel corpora derived from Common Crawl and Internet Archive data.

Pedro Ortiz Suarez with a poster on mOSCAR

"mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus" (Futeral et al., 2025): introducing the mOSCAR dataset: the first large-scale multilingual and multimodal document corpus crawled from the web.

Malte, Pedro, and their DFKI colleagues with their award for “best paper runner up” at the 4th Table Representation Learning Workshop

Left to right: Malte Ostendorff, Pedro Ortiz Suarez, Ekaterina Borisova, Georg Rehm, Nils Feldhus, and Raia Abu Ahmad, with their award for “best paper runner up” at the 4th Table Representation Learning Workshop

"Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data" (Borisova et al., 2025): investigating the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. This paper was the runner up for the best paper award at the Fourth Table Representation Learning Workshop! 🏆

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Common Crawl Foundation at ACL 2025

Research With and By Common Crawl

Next Steps

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use