Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.
Thom Vaughan
Thom is a Principal Engineer at the Common Crawl Foundation.