Columnar Index

The Common Crawl Foundation provides two indexes for querying the Common Crawl Corpus: the CDXJ Index and the Columnar Index. This page introduces the Columnar Index and gives some examples of how to use it.

What is the Columnar Index?

The Columnar Index is one of the indexes available for querying the Common Crawl corpus. As the name suggests, it is an index to the WARC files and URLs in the Common Crawl corpus in a columnar format (Apache Parquet™). This format of the index is suited to efficient analytical and/or bulk queries of the data, saving time and computing resources.

The Parquet files of the columnar index are compatible with many analytical libraries and toolkits, such as AWS Athena, Apache Spark™, Pandas, Polars, Apache Arrow™, DuckDB, and more. That said, you can use any tool you like: the Columnar Index is free to access or download for anybody. You'll find all files for the Columnar Index at s3://commoncrawl/cc-index/table/cc-main/warc/.

This page provides examples on querying the data with AWS Athena, which is the simplest way to interact with the columnar index using SQL. We also have examples using other libraries for different kinds of analysis:

  • Apache Spark™ is suited to analysis involving web page content (e.g. word counts). The cc-pyspark repository contains examples on how to process the Common Crawl corpus using PySpark.
  • DuckDB works well for local SQL queries. The Whirlwind Tour contains examples querying the Columnar Index in this way.

Using the Columnar Index with AWS Athena

AWS Athena is a serverless service to analyze data on S3 using SQL. It uses Trino under the hood which provides extra functionality, including lambda expressions. Using AWS Athena is not free, but costs for small queries are low, and you only pay for what you use. The index table of a single monthly crawl contains about 300GB of data, which defines an upper bound on the cost (about $ 1.50 USD as of September 2025). Most queries require only part of the data to be scanned, making them usually even cheaper than that.

You will need an AWS account to use AWS Athena: see the Athena User Guide for instructions on how to register and set it up.

Registering the Columnar Index as a Database Table

To be able to query the columnar index, you first need to register the columnar index as a database table. The steps below demonstrate how to do this:

1. Open the Athena Query Editor. Make sure you're in the us-east-1 region where all the Common Crawl data is located.

2. If prompted, set up a query result location (see the AWS Athena Documentation).

3. Create a database by entering the command below, then pressing 'Run Query':

CREATE DATABASE ccindex

4. Make sure the new database ccindex is selected, then select 'New Query'. Create a table by entering the whole command below, then pressing 'Run Query'. This will create a table ccindex with a schema that matches the data on S3. Note that the table schema may evolve over time: the most recent schema version is available on GitHub.

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
      url_surtkey                   STRING,
      url                           STRING,
      url_host_name                 STRING,
      url_host_tld                  STRING,
      url_host_2nd_last_part        STRING,
      url_host_3rd_last_part        STRING,
      url_host_4th_last_part        STRING,
      url_host_5th_last_part        STRING,
      url_host_registry_suffix      STRING,
      url_host_registered_domain    STRING,
      url_host_private_suffix       STRING,
      url_host_private_domain       STRING,
      url_host_name_reversed        STRING,
      url_protocol                  STRING,
      url_port                      INT,
      url_path                      STRING,
      url_query                     STRING,
      fetch_time                    TIMESTAMP,
      fetch_status                  SMALLINT,
      fetch_redirect                STRING,
      content_digest                STRING,
      content_mime_type             STRING,
      content_mime_detected         STRING,
      content_charset               STRING,
      content_languages             STRING,
      content_truncated             STRING,
      warc_filename                 STRING,
      warc_record_offset            INT,
      warc_record_length            INT,
      warc_segment                  STRING)
    PARTITIONED BY (
      crawl                         STRING,
      subset                        STRING)
    STORED AS parquet
    LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

5. In the command above, there are two PARTITIONED BY columns: crawl and subset. These actually refer to subdirectories in the structure of the Columnar Index, one for every monthly crawl and the WARC subset. Partitions make it easier to update the crawl each month and help limit the cost of querying the data. To make Athena recognise the data partitions on S3, and to update the table as new crawls are added, run the command:

MSCK REPAIR TABLE ccindex

Each month, a new partition (a "directory") is added: e.g. crawl=CC-MAIN-2025-33/. The new partition is not visible and searchable unless it has been discovered by the REPAIR TABLE command. Running the command shows you which partitions have been newly discovered, e.g:

Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2025-33/subset=crawldiagnostics
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2025-33/subset=robotstxt
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2025-33/subset=warc

Querying the Columnar Index

Query 1: Counting Pages Per Domain

Time to run the first query! This query counts the number of pages per domain within a single top-level domain, .mt (Malta). Enter the following command and then press 'Run Query' as before:

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2025-33'
  AND subset = 'warc'
  AND url_host_tld = 'mt'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC

The result should appear in a matter of seconds. This query only scanned about 240KB, meaning the query cost a fraction of a cent. Notice that the command filters the data by partition (crawl = 'CC-MAIN-2025-33') and selects a small top-level domain for testing (url_host_tld = 'mt'). It's good practice to develop more complex queries with such filters applied to keep the costs for trials low.

Query 2: Finding Domains with Multilingual Content

The next example demonstrates the power of Trino functions by looking for domains with multilingual content. It uses regular expressions to identify ISO 639-1 language codes in the URL: e.g. the "en" in https://example.com/about/en/page.html.

For demonstration purposes, the query is restricted to a single, small top-level domain (.va). Run the following query:

SELECT url_host_registered_domain AS domain,
       COUNT(DISTINCT(url_path_lang)) as n_lang,
       COUNT(*) as n_pages,
       histogram(url_path_lang) as lang_counts
FROM "ccindex"."ccindex",
  UNNEST(regexp_extract_all(url_path, '(?<=/)(?:[a-z][a-z])(?=/)')) AS t (url_path_lang)
WHERE crawl = 'CC-MAIN-2025-33'
  AND subset = 'warc'
  AND url_host_registry_suffix = 'va'
GROUP BY url_host_registered_domain
HAVING COUNT(*) >= 100
   AND COUNT(DISTINCT(url_path_lang)) >= 1
ORDER BY n_pages DESC;

The result should contain four columns:

  • domain: (e.g. vatican.va, vaticannews.va)
  • n_lang: (a count of distinct languages detected in each domain)
  • n_pages: (a count of pages in each domain)
  • lang_counts: (a hash map of language codes and their counts, e.g. {de=2698, hi=1, ru=13, ...})

The magic is done by the UNNEST command, which first extracts all two-letter path elements (e.g., /en/), then unrolls the elements into a new column url_path_lang. If two or more path elements are found, it creates multiple rows. Finally, it counts pages and unique languages and generates a histogram of language codes.

Resources

  • Path to the the index on AWS S3: s3://commoncrawl/cc-index/table/cc-main/warc/.
  • Information about the schema, including an explaination of the fields, is available on Common Crawl's Data Bucket.
  • The current Columnar Index schema is available on GitHub. Newer fields contain a since field indicating when they were first used. It's possible to use the most recent schema also to access the index partitions of older crawls. Usually, it's even safe to access columns introduced later. However, be aware that the values in such a column are empty, or null!
  • The cc-pyspark repository contains examples on interacting with the Columnar Index using PySpark.
  • The cc-index-table repository contains the code to build the Columnar Index from the older CDXJ index. Further examples on querying the Columnar Index using SQL are available in the same repository.

Acknowledgements

This guide was based on the original blog post announcing the release of an Index to WARC Files and URLs in Columnar Format.

Background information about the Columnar Index was derived from the Whirlwind Tour of Common Crawl's Datasets using Python and the poster presentation from the 2019 IIPC Web Archiving Conference.

The second AWS Athena example is taken from the cc-index-table repository.

Text Link