Hugging Face stores over 30 PB of models, datasets, and spaces in Git LFS repositories. Because Git stores and versions on the file level, any change to a file requires re-uploading the complete asset – expensive operations when average Parquet and CSV files on the Hub range between 200-300 MB, average Safetensor files around 1 GB, and GGUF files can exceed 8 GB. Imagine modifying only a single line of metadata in a GGUF file and waiting for the multi-gigabyte file to upload; along with user time and transfer costs, Git LFS also then needs to save lots of full versions of each files, bloating storage costs.
The plot below illustrates the expansion of LFS storage in model, dataset, and space repositories on the Hub between March 2022 and September 2024:
Hugging Face’s Xet team is taking a distinct approach to storage by storing files as chunks. By only transferring modified chunks, we are able to dramatically improve each storage efficiency and iteration speed while ensuring reliable access to evolving datasets and models. Here’s how it really works.
Content-Defined Chunking Foundations
The strategy that we use to chunk files known as content-defined chunking (CDC). As an alternative of treating a file as an indivisible unit, CDC breaks files down into variable-sized chunks, using the information to define boundaries. To compute the chunks, we apply a rolling hash algorithm that scans the file’s byte sequence.
Consider a file with the contents:
transformerstransformerstransformers
We’re using text for illustration, but this may very well be any sequence of bytes.
A rolling hash algorithm computes a hash over a sliding window of information. On this case, with a window of length 4 the hash could be computed first over tran, then rans, then ansf and so forth until the top of the file.
Chunk boundaries are determined when a hash satisfies a predefined condition, resembling:
hash(data) % 2^12 == 0
If the sequence mers produces a hash that meets this condition, the file will probably be split into three chunks:
transformers | transformers | transformers
The content of those chunks is hashed to create mapping between chunk hash and bytes and can eventually be stored in a content-addressed store (CAS). Since all three chunks are an identical, we only store one chunk within the CAS for built-in deduplication. 🪄
Insertions and Deletions
When the contents of a file change, CDC allows for fine-grained updates that make it robust to handling insertions and deletions. Let’s modify the file by inserting super, making the brand new file contents:
transformerstransformerssupertransformers
After applying the rolling hash again with the identical boundary condition, the brand new chunks seem like this:
transformers | transformers | supertransformers
We don’t need to save lots of chunks we’ve got seen before; they’re already stored. Nevertheless, supertransformers is a brand new chunk. Thus, the one cost of saving the updated version of this file is uploading and storing one latest chunk.
To validate this optimization in the true world, we benchmarked our previous implementation of CDC-backed storage at XetHub against Git LFS and located a consistent 50% improvement in storage and transfer performance across three iterative development use cases. One example was the CORD-19 dataset, a set of COVID-19 research papers curated between 2020 and 2022 with 50 incremental updates. The comparison between Xet-backed and Git LFS-backed repositories is summarized below:
| Metric | Git LFS-backed Repository | Xet-backed Repository |
|---|---|---|
| Average Download Time | 51 minutes | 19 minutes |
| Average Upload Time | 47 minutes | 24 minutes |
| Storage Used | 8.9 GB | 3.52 GB |
By only transferring and saving modified chunks, the Xet-backed repository using CDC (alongside various techniques to enhance compression and streamline network requests) showed significantly faster upload/download times and drastically cut the quantity of storage required to capture all versions of the dataset. Curious to learn more? Read the full benchmark.
What CDC means for the Hub
How would CDC work on the forms of files stored on Hugging Face Hub? We threw together a straightforward deduplication estimator to visualise the potential storage savings of applying CDC to a set of files. Running this tool on two versions of the model.safetensors file in openai-community/gpt2 uploaded over the course of the repository’s commit history returned the next result:
The greenness reflects significant overlap between the 2 versions, and thus a possibility to deduplicate each inside each file and across the versions.
| Git LFS Storage Required | Xet-backed Storage Required | |
|---|---|---|
| Version 1 | 664 MB | 509 MB |
| Version 2 | 548 MB | 136 MB |
| Total | 1.2 GB | 645 MB |
On this case, using our Xet-based storage backend would save considerable upload/download time for the second version, in addition to reduce the full storage footprint by 53%. With compression, we estimate an extra 10% of savings.
Our initial research into repositories across the Hub shows positive results for some fine-tuned models and plenty of model checkpoints. Wonderful-tuned models modify only a subset of parameters, so many of the model stays unchanged across versions, making them an important candidate for deduplication. Model checkpoints, which capture incremental training states, are also good targets as changes between checkpoints are sometimes minimal. Each show deduplication ratios within the range of 30-85%. PyTorch model checkpoints make up around 200 TB of total storage on the Hub. At 50% deduplication, we might save as much as 100 TB of storage immediately and roughly 7-8 TB monthly going forward.
Beyond reducing storage costs, chunk-level deduplication also improves upload/download speeds, as only the modified chunks are transferred. That is an important profit to groups working with multiple versions of models or datasets because it minimizes user and machine waiting time.
Our team is currently working through our POC of Xet-backed storage for the Hub and hope to roll out some Xet-backed repositories in early 2025. Follow us to learn more as we share our learnings on future topics like scaling CDC across globally distributed repositories, balancing network performance, privacy boundaries, and parallelizing our chunking algorithm.
