From Files to Chunks: Improving HF Storage Efficiency

Hugging Face stores over 30 PB of models, datasets, and spaces in Git LFS repositories. Because Git stores and versions on the file level, any change to a file requires re-uploading the complete asset – expensive operations when average Parquet and CSV files on the Hub range between 200-300 MB, average Safetensor files around 1 GB, and GGUF files can exceed 8 GB. Imagine modifying only a single line of metadata in a GGUF file and waiting for the multi-gigabyte file to upload; along with user time and transfer costs, Git LFS also then needs to save lots of full versions of each files, bloating storage costs.

The plot below illustrates the expansion of LFS storage in model, dataset, and space repositories on the Hub between March 2022 and September 2024:

Parquet Layout

Hugging Face’s Xet team is taking a distinct approach to storage by storing files as chunks. By only transferring modified chunks, we are able to dramatically improve each storage efficiency and iteration speed while ensuring reliable access to evolving datasets and models. Here’s how it really works.

Content-Defined Chunking Foundations

The strategy that we use to chunk files known as content-defined chunking (CDC). As an alternative of treating a file as an indivisible unit, CDC breaks files down into variable-sized chunks, using the information to define boundaries. To compute the chunks, we apply a rolling hash algorithm that scans the file’s byte sequence.

Consider a file with the contents:

transformerstransformerstransformers

We’re using text for illustration, but this may very well be any sequence of bytes.

A rolling hash algorithm computes a hash over a sliding window of information. On this case, with a window of length 4 the hash could be computed first over tran, then rans, then ansf and so forth until the top of the file.

Chunk boundaries are determined when a hash satisfies a predefined condition, resembling:

hash(data) % 2^12 == 0

If the sequence mers produces a hash that meets this condition, the file will probably be split into three chunks:

transformers | transformers | transformers

The content of those chunks is hashed to create mapping between chunk hash and bytes and can eventually be stored in a content-addressed store (CAS). Since all three chunks are an identical, we only store one chunk within the CAS for built-in deduplication. 🪄

Insertions and Deletions

When the contents of a file change, CDC allows for fine-grained updates that make it robust to handling insertions and deletions. Let’s modify the file by inserting super, making the brand new file contents:

transformerstransformerssupertransformers

After applying the rolling hash again with the identical boundary condition, the brand new chunks seem like this:

 transformers | transformers | supertransformers

We don’t need to save lots of chunks we’ve got seen before; they’re already stored. Nevertheless, supertransformers is a brand new chunk. Thus, the one cost of saving the updated version of this file is uploading and storing one latest chunk.

To validate this optimization in the true world, we benchmarked our previous implementation of CDC-backed storage at XetHub against Git LFS and located a consistent 50% improvement in storage and transfer performance across three iterative development use cases. One example was the CORD-19 dataset, a set of COVID-19 research papers curated between 2020 and 2022 with 50 incremental updates. The comparison between Xet-backed and Git LFS-backed repositories is summarized below:

Metric	Git LFS-backed Repository	Xet-backed Repository
Average Download Time	51 minutes	19 minutes
Average Upload Time	47 minutes	24 minutes
Storage Used	8.9 GB	3.52 GB

By only transferring and saving modified chunks, the Xet-backed repository using CDC (alongside various techniques to enhance compression and streamline network requests) showed significantly faster upload/download times and drastically cut the quantity of storage required to capture all versions of the dataset. Curious to learn more? Read the full benchmark.

What CDC means for the Hub

How would CDC work on the forms of files stored on Hugging Face Hub? We threw together a straightforward deduplication estimator to visualise the potential storage savings of applying CDC to a set of files. Running this tool on two versions of the model.safetensors file in openai-community/gpt2 uploaded over the course of the repository’s commit history returned the next result:

Parquet Layout

The greenness reflects significant overlap between the 2 versions, and thus a possibility to deduplicate each inside each file and across the versions.

	Git LFS Storage Required	Xet-backed Storage Required
Version 1	664 MB	509 MB
Version 2	548 MB	136 MB
Total	1.2 GB	645 MB

On this case, using our Xet-based storage backend would save considerable upload/download time for the second version, in addition to reduce the full storage footprint by 53%. With compression, we estimate an extra 10% of savings.

Our initial research into repositories across the Hub shows positive results for some fine-tuned models and plenty of model checkpoints. Wonderful-tuned models modify only a subset of parameters, so many of the model stays unchanged across versions, making them an important candidate for deduplication. Model checkpoints, which capture incremental training states, are also good targets as changes between checkpoints are sometimes minimal. Each show deduplication ratios within the range of 30-85%. PyTorch model checkpoints make up around 200 TB of total storage on the Hub. At 50% deduplication, we might save as much as 100 TB of storage immediately and roughly 7-8 TB monthly going forward.

Beyond reducing storage costs, chunk-level deduplication also improves upload/download speeds, as only the modified chunks are transferred. That is an important profit to groups working with multiple versions of models or datasets because it minimizes user and machine waiting time.

Our team is currently working through our POC of Xet-backed storage for the Hub and hope to roll out some Xet-backed repositories in early 2025. Follow us to learn more as we share our learnings on future topics like scaling CDC across globally distributed repositories, balancing network performance, privacy boundaries, and parallelizing our chunking algorithm.

Source link

From Files to Chunks: Improving HF Storage Efficiency

Content-Defined Chunking Foundations

Insertions and Deletions

What CDC means for the Hub

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

The Geometry of Laziness: What Angles Reveal About AI Hallucinations

Faster Text Generation with Self-Speculative Decoding

Understanding Vibe Proving

You may have designed state-of-the-art positional encoding

From Files to Chunks: Improving HF Storage Efficiency

Content-Defined Chunking Foundations

Insertions and Deletions

What CDC means for the Hub

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.