Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.