< Back to Blog
July 8, 2025

The First WMDQS-Masakhane LangID Hackathon

Note: this post has been marked as obsolete.
In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages.
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

Since the end of 2024, the Common Crawl Foundation has committed to expanding the language coverage of its crawls in order to facilitate the creation of web and language technologies for underrepresented languages. In this effort to improve coverage, we have already started two initiatives: the Web Languages Project, where the community can contribute URLs in underrepresented languages for our seed crawl, and the LangID Project where users can add language identification (LangID or LID) to Common Crawl data, in order for us to improve the models we use to annotate our crawls, and to discover new data.

In line with these two initiatives, we also announced the 1st Workshop on Multilingual Data Quality Signals that we are organizing in collaboration with our colleagues at MLCommons, EleutherAI, and Johns Hopkins' HLTCOE. This workshop which will be collocated with COLM 2025, in Montréal, Canada, will also host a shared task on language identification where we expect to collect more annotations for our LangID, and then develop new LangID solutions with participants that are robust, lightweight, and open source, and that can be later maintained by us and our collaborators.

In the context of this shared task, Common Crawl and MLCommons hosted a hackathon on June 26th, to collect LangID annotations for African languages, in collaboration with our friends and colleagues at Masakhane and the Data Science for Social Impact research group at the University of Pretoria. We’re pleased to share that the hackathon was a great success, generating approximately 5,000 document annotations and bringing our total to over 17,000 across more than 70 languages. We attribute much of this momentum to the hackathon’s impact.

This hackathon allowed us to set a solid foundation for our dataset and our shared task, and constitutes a large community towards developing web and language technologies for African languages. As such, we would like to express our deepest gratitude to Masakahne, the Data Science for Social Impact research group, and all of the community members who participated in the hackathon and those who have continued to contribute with annotations and feedback. We would also like to thank Idris Abdulmumin and Vukosi Marivate in particular, who made this hackathon possible.

The WMDQS LangID shared task remains open to contributions, and for those who would like to participate directly, especially with new LangID models and solutions, registration is open until the 21st of July.

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.