< Back to Blog
September 29, 2025

Web Languages Needing Review by Native Speakers

Note: this post has been marked as obsolete.
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.
Thom Vaughan
Thom Vaughan
Thom is a Principal Engineer at the Common Crawl Foundation.

Since October of 2024, we’ve been gathering URLs in languages other than English (or “LOTE” for short), which we have added to our “seed crawl”, with the aim of improving coverage of languages, communities, and cultures in our crawls. We’re doing this via our Web Languages Project (introduced in this blog post in December of last year), and so far we’ve had 266 contributions from 67 people, thanks to whom we’ve added over 4,700 LOTE URLs to our seed list so far.

Since August of 2018 we have used the Compact Language Detector 2 (CLD2) to annotate the language(s) in which a page is written. It’s ​​able to identify 160 different languages (up to 3 languages per document) and uses the ISO 639-3 language code.

So far, there are 42 files in the Web Languages repository which need review by a native speaker (we’re counting Latin here, although of course lamentably there are no native speakers of Latin left) and out of these there are seven languages which CLD2 is not capable of recognising.

Languages contributions which need a review by a native speaker

Click a column header to sort.

Language ISO 639-3 Code Recognised by CLD2? Coverage in CC-MAIN-2025-38 Link to contribute
Achineseacenon/aliving/achinese.md
Albaniansqiyes0.0474living/albanian.md
Basqueeusyes0.0306living/basque.md
Bosnianbosyes0.0557living/bosnian.md
Buginesebugnon/aliving/buginese.md
Catalancatyes0.1741living/catalan.md
Chokwecjknon/aliving/chokwe.md
Cornishcornon/aliving/cornish.md
Croatianhrvyes0.2081living/croatian.md
Danishdanyes0.4432living/danish.md
Estonianestyes0.1170living/estonian.md
Faroesefaoyes0.0045living/faroese.md
Galicianglgyes0.0279living/galician.md
Icelandicislyes0.0415living/icelandic.md
Irishgleyes0.0069living/irish.md
Japanesejpnyes5.2018living/japanese.md
Kalaallisutkalyes0.0009living/kalaallisut.md
Koreankoryes0.7754living/korean.md
Latinlatyes0.0983historical/latin.md
Lithuanianlityes0.1601living/lithuanian.md
Luxembourgishltzyes0.0040living/luxembourgish.md
Macedonianmkdyes0.0375living/macedonian.md
Maltesemltyes0.0036living/maltese.md
Mandarin Chinesecmnnon/aliving/mandarin_chinese.md
Maorimriyes0.0014living/maori.md
Norwegiannoryes0.3213living/norwegian.md
Panjabipanyes0.0074living/panjabi.md
Polishpolyes1.6602living/polish.md
Portugueseporyes2.0696living/portuguese.md
Romanshrohyes0.0011living/romansh.md
Russianrusyes6.1083living/russian.md
Sardiniansrdnon/aliving/sardinian.md
Scottish Gaelicglayes0.0014living/scottish_gaelic.md
Serbiansrpyes0.2053living/serbian.md
Slovakslkyes0.3853living/slovak.md
Slovenianslvyes0.1264living/slovenian.md
Thaithayes0.3842living/thai.md
Uighuruigyes0.0012living/uighur.md
Walloonwlnnon/aliving/walloon.md
Welshcymyes0.0094living/welsh.md
Western Frisianfryyes0.0032living/western_frisian.md
Yiddishyidyes0.0019living/yiddish.md

Out of all of the contributors, we would like to thank Ethan Wenokur, Evan Pacini, Twan Goosen, and Swapnil Tripathi in particular.  We’re very grateful to these people for their substantial contributions to the Web Languages project.

This release was authored by:
Thom is a Principal Engineer at the Common Crawl Foundation.
Thom Vaughan

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.