Columnar Index
The Common Crawl Foundation provides two indexes for querying the Common Crawl Corpus: the CDXJ Index and the Columnar Index. This page introduces the Columnar Index and gives some examples of how to use it.
What is the Columnar Index?
The Columnar Index is one of the indexes available for querying the Common Crawl corpus. As the name suggests, it is an index to the WARC files and URLs in the Common Crawl corpus in a columnar format (Apache Parquet™). This format of the index is suited to efficient analytical and/or bulk queries of the data, saving time and computing resources.
The Parquet files of the columnar index are compatible with many analytical libraries and toolkits, such as AWS Athena, Apache Spark™, Pandas, Polars, Apache Arrow™, DuckDB, and more. That said, you can use any tool you like: the Columnar Index is free to access or download for anybody. You'll find all files for the Columnar Index at s3://commoncrawl/cc-index/table/cc-main/warc/.
This page provides examples on querying the data with AWS Athena, which is the simplest way to interact with the columnar index using SQL. We also have examples using other libraries for different kinds of analysis:
- Apache Spark™ is suited to analysis involving web page content (e.g. word counts). The cc-pyspark repository contains examples on how to process the Common Crawl corpus using PySpark.
- DuckDB works well for local SQL queries. The Whirlwind Tour contains examples querying the Columnar Index in this way.
Using the Columnar Index with AWS Athena
AWS Athena is a serverless service to analyze data on S3 using SQL. It uses Trino under the hood which provides extra functionality, including lambda expressions. Using AWS Athena is not free, but costs for small queries are low, and you only pay for what you use. The index table of a single monthly crawl contains about 300GB of data, which defines an upper bound on the cost (about $ 1.50 USD as of September 2025). Most queries require only part of the data to be scanned, making them usually even cheaper than that.
You will need an AWS account to use AWS Athena: see the Athena User Guide for instructions on how to register and set it up.
Registering the Columnar Index as a Database Table
To be able to query the columnar index, you first need to register the columnar index as a database table. The steps below demonstrate how to do this:
1. Open the Athena Query Editor. Make sure you're in the us-east-1 region where all the Common Crawl data is located.
2. If prompted, set up a query result location (see the AWS Athena Documentation).
3. Create a database by entering the command below, then pressing 'Run Query':
CREATE DATABASE ccindex4. Make sure the new database ccindex is selected, then select 'New Query'. Create a table by entering the whole command below, then pressing 'Run Query'. This will create a table ccindex with a schema that matches the data on S3. Note that the table schema may evolve over time: the most recent schema version is available on GitHub.
CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
url_surtkey STRING,
url STRING,
url_host_name STRING,
url_host_tld STRING,
url_host_2nd_last_part STRING,
url_host_3rd_last_part STRING,
url_host_4th_last_part STRING,
url_host_5th_last_part STRING,
url_host_registry_suffix STRING,
url_host_registered_domain STRING,
url_host_private_suffix STRING,
url_host_private_domain STRING,
url_host_name_reversed STRING,
url_protocol STRING,
url_port INT,
url_path STRING,
url_query STRING,
fetch_time TIMESTAMP,
fetch_status SMALLINT,
fetch_redirect STRING,
content_digest STRING,
content_mime_type STRING,
content_mime_detected STRING,
content_charset STRING,
content_languages STRING,
content_truncated STRING,
warc_filename STRING,
warc_record_offset INT,
warc_record_length INT,
warc_segment STRING)
PARTITIONED BY (
crawl STRING,
subset STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';5. In the command above, there are two PARTITIONED BY columns: crawl and subset. These actually refer to subdirectories in the structure of the Columnar Index, one for every monthly crawl and the WARC subset. Partitions make it easier to update the crawl each month and help limit the cost of querying the data. To make Athena recognise the data partitions on S3, and to update the table as new crawls are added, run the command:
MSCK REPAIR TABLE ccindexEach month, a new partition (a "directory") is added: e.g. crawl=CC-MAIN-2025-33/. The new partition is not visible and searchable unless it has been discovered by the REPAIR TABLE command. Running the command shows you which partitions have been newly discovered, e.g:
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2025-33/subset=crawldiagnostics
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2025-33/subset=robotstxt
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2025-33/subset=warcQuerying the Columnar Index
Query 1: Counting Pages Per Domain
Time to run the first query! This query counts the number of pages per domain within a single top-level domain, .mt (Malta). Enter the following command and then press 'Run Query' as before:
SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2025-33'
AND subset = 'warc'
AND url_host_tld = 'mt'
GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESCThe result should appear in a matter of seconds. This query only scanned about 240KB, meaning the query cost a fraction of a cent. Notice that the command filters the data by partition (crawl = 'CC-MAIN-2025-33') and selects a small top-level domain for testing (url_host_tld = 'mt'). It's good practice to develop more complex queries with such filters applied to keep the costs for trials low.
Query 2: Finding Domains with Multilingual Content
The next example demonstrates the power of Trino functions by looking for domains with multilingual content. It uses regular expressions to identify ISO 639-1 language codes in the URL: e.g. the "en" in https://example.com/about/en/page.html.
For demonstration purposes, the query is restricted to a single, small top-level domain (.va). Run the following query:
SELECT url_host_registered_domain AS domain,
COUNT(DISTINCT(url_path_lang)) as n_lang,
COUNT(*) as n_pages,
histogram(url_path_lang) as lang_counts
FROM "ccindex"."ccindex",
UNNEST(regexp_extract_all(url_path, '(?<=/)(?:[a-z][a-z])(?=/)')) AS t (url_path_lang)
WHERE crawl = 'CC-MAIN-2025-33'
AND subset = 'warc'
AND url_host_registry_suffix = 'va'
GROUP BY url_host_registered_domain
HAVING COUNT(*) >= 100
AND COUNT(DISTINCT(url_path_lang)) >= 1
ORDER BY n_pages DESC;The result should contain four columns:
domain: (e.g.vatican.va,vaticannews.va)n_lang: (a count of distinct languages detected in each domain)n_pages: (a count of pages in each domain)lang_counts: (a hash map of language codes and their counts, e.g.{de=2698, hi=1, ru=13, ...})
The magic is done by the UNNEST command, which first extracts all two-letter path elements (e.g., /en/), then unrolls the elements into a new column url_path_lang. If two or more path elements are found, it creates multiple rows. Finally, it counts pages and unique languages and generates a histogram of language codes.
Resources
- Path to the the index on AWS S3:
s3://commoncrawl/cc-index/table/cc-main/warc/. - Information about the schema, including an explaination of the fields, is available on Common Crawl's Data Bucket.
- The current Columnar Index schema is available on GitHub. Newer fields contain a
sincefield indicating when they were first used. It's possible to use the most recent schema also to access the index partitions of older crawls. Usually, it's even safe to access columns introduced later. However, be aware that the values in such a column are empty, ornull! - The cc-pyspark repository contains examples on interacting with the Columnar Index using PySpark.
- The cc-index-table repository contains the code to build the Columnar Index from the older CDXJ index. Further examples on querying the Columnar Index using SQL are available in the same repository.
Acknowledgements
This guide was based on the original blog post announcing the release of an Index to WARC Files and URLs in Columnar Format.
Background information about the Columnar Index was derived from the Whirlwind Tour of Common Crawl's Datasets using Python and the poster presentation from the 2019 IIPC Web Archiving Conference.
The second AWS Athena example is taken from the cc-index-table repository.