< Back to Blog
June 25, 2024

May/June 2024 Newsletter

Note: this post has been marked as obsolete.
We’re pleased to share our newsletter for May/June 2024, featuring the latest updates, events, and highlights from our community.
Greg Lindahl
Greg Lindahl
Greg is the Chief Technology Officer at the Common Crawl Foundation.
Table of Contents:
  • Common Crawl Celebrates Our 100th Crawl Since 2008!
  • AI and the Right to Learn on an Open Internet
  • Recent Research Using Common Crawl Data
  • Updates to Our Data Products – Help Wanted!
  • Volunteer for Common Crawl!

Common Crawl Celebrates Our 100th Crawl since 2008

Our latest crawl, May 2024, marks a milestone for Common Crawl – our 100th crawl since we began crawling in 2008! Many people have been involved in making this happen over the years, and we’d like to thank all of the emeritus members of our team: Ahad Rahna, Lisa Green, Allison Domicone, Jordan Mendelson, Stephen Merity, Julien Nioche, Sara Crouse, and Alex Xue. Thank you from all of the current members of our team!

AI and the Right to Learn on an Open Internet

Panel on the Risks to the Open Internet with Michael Weinberg, Cara Gagliano, Richard Gingras, and Michael Brawer. Moderated by Mike Masnick.
Left to right: Michael Weinberg, Cara Gagliano, Richard Gingras, Michael Brawer, and Mike Masnick. Photo credit: Quinn Kowitt

On April 30th, Common Crawl Foundation hosted an event in New York for a select group of leaders in AI, technology, media, and content. The conference, co-hosted with Professor Jeff Jarvis, was intended to foster an open dialogue between stakeholders about how to achieve a common goal of supporting a right to learn on an open Internet. The one-day event, held at the Craig Newmark Graduate School of Journalism at CUNY, featured opening remarks, firestarter mini-sessions, panel discussions, demo time, and networking opportunities. Topics of discussion ranging from the risks to the Open Internet and fair use and large language model training to smart uses of AI in journalism and business models and solutions.  Sponsors of the conference were Kearney, Tola Capital, and CCIA.

Recent Research Using Common Crawl Data

Updates to Our Data Products – Help Wanted!

Our summer intern, Ford Heilizer, has been hard at work making a tool that transforms our usual WARC/WAT/WET data into a table. If the first thing that you do when you download our data is to stick everything in a table, please contact us at [email protected]. We'd love any advice you have to offer!

We are also thinking about a project to make a 1:1 round-trippable format of WARC to files in a ZIP, with the WARC metadata saved in spreadsheets. We hope this new format will be useful for users who want to process just a couple of WARCs worth of data on a laptop. If this interests you, please contact us!

We made a couple of small updates to our existing interfaces.

If you use the Web Graph, https://index.commoncrawl.org/graphinfo.json now contains the list of crawls in each Web Graph. If you use the cdx index, https://index.commoncrawl.org/collinfo.json now has 2 new fields, “from” and “to”, giving the exact dates when the crawling started and ended.

Volunteer for Common Crawl!

Common Crawl has had some significant contributions made by volunteers over the years, whether they’ve been technologists who love the data, people who have used the data and want to contribute some code as a result, or researchers who have written a paper and open sourced some code.

We also have a list of relatively simple tasks to get you started. Please contact us at [email protected] if interested.

This release was authored by:
No items found.