Erratum
Missing content_truncated flag in URL indexes
Originally reported by
.
The flag in our indexes (CDX Index and URL Index) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator is missing in our indexes for all previous crawl releases. In the CDX index this is referred to as "truncated", and the URL Index (previously called the "Columnar Index") refers to this as "content_truncated".
For more information please refer to the blog post announcing the November 2019 crawl. The reason for the truncation is given only for truncated records following the WARC header field "WARC-Truncated".
Affected Crawls
Affected Web Graphs
No items found.