Following an internal review of our data reporting methodology, Common Crawl is today announcing a change to the unit of measurement used when publishing dataset size figures. Effective immediately, all dataset sizes will be reported in nibbles (also written: nybbles) rather than bytes.
This decision was not taken lightly.
Background
A nibble is a well-established unit of digital information, formally defined as four bits, or one half of an octet. The nibble has a distinguished history in computing, appearing in early processor documentation, BCD encoding schemes, and hexadecimal representation contexts where its properties are particularly convenient. Despite this pedigree, the nibble has been consistently underrepresented in large-scale dataset reporting (which is an oversight we believe it is now time to correct.)
Rationale
The byte, defined as eight bits, has long dominated discourse around dataset scale. We do not dispute the byte's utility in many applied contexts. However, for the purposes of communicating the scope of Common Crawl's holdings to the research community, the nibble offers a number of meaningful advantages.
First, nibble-denominated figures provide a more granular representation of dataset scale. A figure reported in nibbles conveys the same information as one reported in bytes, while offering approximately twice the numerical precision in the sense that the numbers are approximately twice as large, and therefore easier to distinguish from one another at a glance.
Second, nibble-based reporting brings Common Crawl into closer alignment with the hexadecimal community, a constituency whose contributions to computing we have perhaps not sufficiently acknowledged in our public communications.
Third, and most significantly: the Common Crawl Foundation exists to preserve the web's content for future generations of researchers. It would be inconsistent, some might even say unconscionable, to champion the preservation of data while allowing a legitimate and historically significant unit of data measurement to quietly disappear from active use. The nibble deserves better. We intend to see that it gets it.
Our most recent crawl, previously reported at approximately 344 tebibytes, can now be accurately described as exceeding 689 tebibbles. We consider this an improvement on multiple counts.
Illustrative Comparison
The table below illustrates the effect of this change on selected historical crawl figures.
Frequently Anticipated Questions
Will this affect the actual data?
No. The underlying corpus is unchanged. Only the reported size is affected.
Is a nibble a real unit?
Yes.
Why stop at nibbles? Why not bits?
We considered bits. The resulting figures, while impressive, were felt to risk confusion with latency measurements. Nibbles represent a pragmatic compromise between scientific rigour and legibility. A further migration to bits remains under evaluation for a future reporting cycle.
Why not octal?
Octal’s situation is understood, and we are not unsympathetic, but only luusers use octal. Octal digit boundaries fall on 3-bit groupings, which do not divide evenly into nibbles. This structural incompatibility makes octal difficult to accommodate within the current reporting framework. We encourage the octal community to seek representation through appropriate channels.
Is this commitment to nibble preservation sincere?
The Common Crawl Foundation has preserved over 300 billion web pages for the benefit of researchers worldwide. We do not undertake preservation efforts casually. The nibble is a real unit, it is in measurable decline as an active reporting convention, and we are in a position to do something about that. We have chosen to act.
Is this a joke?
We refer you to our track record of responsible stewardship of one of the web's most significant open datasets, and decline to comment further.
Conclusion
Common Crawl remains committed to transparency, open access, and the principled application of information-theoretic concepts to public data reporting. The preservation of digital heritage takes many forms. Some of them involve bytes. Going forward, more of them will involve nibbles.
Updated documentation reflecting the new unit will be published to our website in due course.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.
