Content-defined chunking (CDC) plays a central role in enabling deduplication inside a Xet-backed repository. The thought is easy: break each file’s data into chunks, store only unique ones, reap the advantages.
In practice, it’s more complex. If we focused solely on maximizing deduplication, the design would call for the smallest possible chunk size. By doing that, we’d create significant overheads for the infrastructure and the builders on the Hub.
On Hugging Face’s Xet team, we’re bringing CDC from theory to production to deliver faster uploads and downloads to AI builders (by an element of 2-3x in some cases). Our tenet is easy: enable rapid experimentation and collaboration for teams constructing and iterating on models and datasets. This implies specializing in greater than just deduplication; we’re optimizing how data moves across the network, the way it’s stored, and all the development experience.
The Realities of Scaling Deduplication
Imagine uploading a 200GB repository to the Hub. Today, there are a lot of ways to do that, but all use a file-centric approach. To bring faster file transfers to the Hub, we have open-sourced xet-core and hf_xet, an integration with huggingface_hub which uses a chunk-based approach written in Rust.
When you consider a 200GB repository with unique chunks, that is 3 million entries (at ~64KB per chunk) within the content-addressed store (CAS) backing all repositories. If a new edition of a model is uploaded or a branch within the repository is created with different data, more unique chunks are added, driving up the entries within the CAS.
With nearly 45PB across 2 million model, dataset, and space repositories on the Hub, a purely chunk-based approach could incur 690 billion chunks. Managing this volume of content using only chunks is just not viable attributable to:
- Network Overheads: If each chunk is downloaded or uploaded individually, tens of millions of requests are generated on each upload and download, overwhelming each client and server. Even batching queries simply shifts the issue to the storage layer.
- Infrastructure Overheads: A naive CAS that tracks chunks individually would require billions of entries, resulting in steep monthly bills on services like DynamoDB or S3. At Hugging Face’s scale, this quickly adds up.
Briefly, network requests balloon, databases struggle to administer the metadata, and the associated fee of orchestrating each chunk skyrockets all whilst you wait on your files to transfer.
Design Principles for Deduplication at Scale
These challenges result in a key realization:
Deduplication is a performance optimization, not the ultimate goal.
The ultimate goal is to enhance the experience of builders iterating and collaborating on models and datasets. The system components from the client to the storage layer don’t need to ensure deduplication. As a substitute, they leverage deduplication as one tool amongst many to assist on this.
By loosening the deduplication constraint, we naturally arrive at a second design principle:
Avoid communication or storage strategies that scale 1:1 with the variety of chunks.
What does this mean? We scale with aggregation.
Scaling Deduplication with Aggregation
Aggregation takes chunks and groups them, referencing them intelligently in ways in which provide clever (and practical) advantages:
- Blocks: As a substitute of transferring and storing chunks, we bundle data together in blocks of as much as 64MB after deduplication. Blocks are still content-addressed, but this reduces CAS entries by an element of 1,000.
- Shards: Shards provide the mapping between files and chunks (referencing blocks as they achieve this). This enables us to discover which parts of a file have modified, referencing shards generated from past uploads. When chunks are already known to exist within the CAS, they’re skipped, slashing unnecessary transfers and queries.
Together, blocks and shards unlock significant advantages. Nevertheless, when someone uploads a brand new file, how can we know if a piece has been uploaded before so we will eliminate an unnecessary request? Performing a network query for each chunk shouldn’t be scalable and goes against the “no 1:1” principle we mentioned above.
The answer is key chunks that are a 0.1% subset of all chunks chosen with a straightforward modulo condition based on the chunk hash. We offer a worldwide index over these key chunks and the shards they’re present in, in order that when the chunk is queried, the related shard is returned to offer local deduplication. This enables us to leverage the principles of spatial locality. If a key chunk is referenced in a shard, it’s likely that other similar chunk references can be found in the identical shard. This further improves deduplication and reduces network and database requests.
Aggregated Deduplication in Practice
The Hub currently stores over 3.5PB of .gguf files, most of that are quantized versions of other models on the Hub. Quantized models represent an interesting opportunity for deduplication attributable to the nature of quantization where values are restricted to a smaller integer range and scaled. This restricts the range of values in the load matrices, naturally resulting in more repetition. Moreover, many repositories of quantized models store multiple different variants (e.g., Q4_K, Q3_K, Q5_K) with an excellent deal of overlap.
example of this in practice is bartowski/gemma-2-9b-it-GGUF which incorporates 29 quantizations of google/gemma-2-9b-it totalling 191GB. To upload, we use hf_xet integrated with huggingface_hub to perform chunk-level deduplication locally then aggregate and store data on the block level.
Once uploaded, we will begin to see some cool patterns! We’ve included a visualization that shows the deduplication ratio for every block. The darker the block, the more continuously parts of it are referenced across model versions. When you go to the Space hosting this visualization, hovering over any heatmap cell highlights all references to the block in orange across all models while clicking on a cell will select all other files that share blocks:
A single block of deduplication might only represent a number of MB of savings, but as you’ll be able to see there are numerous overlapping blocks! With this many blocks that quickly adds up. As a substitute of uploading 191GB, the Xet-backed version of the gemma-2-9b-it-GGUF repository stores 1515 unique blocks for a complete of roughly 97GB to our test CAS environment (a savings of ~94GB).
While the storage improvements are significant, the true profit is what this implies for contributors to the Hub. At 50MB/s, the deduplication optimizations amount to a 4 hour difference in upload time; a speedup of nearly 2x:
| Repo | Stored Size | Upload Time @ 50MB/s |
|---|---|---|
| Original | 191 GB | 509 minutes |
| Xet-backed | 97 GB | 258 minutes |
Similarly, local chunk caching significantly accelerates downloads. If a file is modified or a brand new quantization is added that has significant overlap with the local chunk cache, you won’t should re-download any chunks which can be unchanged. This contrasts to the file-based approach where the whole thing of the brand new or updated file have to be downloaded.
Taken together, this demonstrates how local chunk-level deduplication paired with block-level aggregation dramatically streamlines not only storage, but developing on the Hub. By providing this level of efficiency in file transfers, AI builders can move faster, iterate quickly, and worry less about hitting infrastructure bottlenecks. For anyone pushing large files to the Hub (whether you are pushing a brand new model quantization or an updated version of a training set) this helps you shift focus to constructing and sharing, reasonably than waiting and troubleshooting.
We’re fast at work, rolling out the primary Xet-backed repositories in the approaching weeks and months! As we try this, we can be releasing more updates to bring these speeds to each builder on the Hub to make file transfers feel invisible.
Follow us on the Hub to learn more about our progress!
