Migrating the Hub from Git LFS to Xet

In January of this 12 months, Hugging Face’s Xet Team deployed a brand new storage backend, and shortly after shifted ~6% of Hub downloads through the infrastructure. This represented a big milestone, however it was just the start. In 6 months, 500,000 repositories holding 20 PB joined the move to Xet because the Hub outgrows Git LFS and transitions to a storage system that scales with the workloads of AI builders.

Today, greater than 1 million people on the Hub are using Xet. In May, it became the default on the Hub for brand new users and organizations. With only just a few dozen GitHub issues, forum threads, and Discord messages, this is probably the quietest migration of this magnitude.

How? For one, the team got here prepared with years of experience constructing and supporting the content addressed store (CAS) and Rust client that provide the system’s foundation. Without these pieces, Git LFS should still be the longer term on the Hub. Nonetheless, the unsung heroes of this migration are:

An integral piece of infrastructure known internally because the Git LFS Bridge
Background content migrations that run across the clock

Together, these components have allowed us to aggressively migrate PBs within the span of days without worrying concerning the impact to the Hub or the community. They’re giving us the peace of mind to maneuver even faster in the approaching weeks and months (skip to the top 👇 to see what’s coming).

Bridges and backward compatibility

Within the early days of planning the migration to Xet, we made just a few key design decisions:

There could be no “hard cut-over” from Git LFS to Xet
A Xet-enabled repository should give you the chance to contain each Xet and LFS files
Repository migrations from LFS to Xet don’t require “locks”; that’s, they will run within the background without disrupting downloads or uploads

Driven by our commitment to the community, these seemingly straightforward decisions had significant implications. Most significantly, we didn’t consider users and teams must have to right away alter their workflow or download a brand new client to interact with Xet-enabled repositories.

If you have got a Xet-aware client (e.g., hf-xet, the Xet integration with huggingface_hub), uploads and downloads go through all the Xet stack. The client either breaks up files into chunks using content defined chunking while uploading, or requests file reconstruction information when downloading. On upload, chunks are passed to CAS and stored in S3. During downloads, CAS provides the chunk ranges the client must request from S3 to reconstruct the file locally.

For older versions of huggingface_hub or huggingface.js, which don’t support chunk-based file transfers, you possibly can still download and upload to Xet repos, but these bytes take a distinct route. When a Xet-backed file is requested from the Hub along the resolve endpoint, the Git LFS Bridge constructs and returns a single presigned URL, mimicking the LFS protocol. The Bridge then does the work of reconstructing the file from the content held in S3 and returns it to the requester.

Git LFS Bridge flow — Greatly simplified view of the Git LFS Bridge – in point of fact this path includes just a few more API calls and components just like the CDN fronting the Bridge, DynamoDB for file metadata, and S3 itself.

To see this in motion, right click on the image above and open it in a brand new tab. The URL redirects from
https://huggingface.co/datasets/huggingface/documentation-images/resolve/essential/blog/migrating-the-hub-to-xet/bridge.png to 1 that begins with https://cas-bridge.xethub.hf.co/xet-bridge-us/.... You can even use curl -vL on the identical URL to see the redirects in your terminal.

Meanwhile, when a non-Xet-aware client uploads a file, it is distributed first to LFS storage then migrated to Xet. This “background migration process,” only briefly mentioned in our docs, powers each the migrations to Xet and upload backward compatibility. It’s behind the migration of well over a dozen PBs of models and datasets and is keeping 500,000 repos in sync with Xet storage all without missing a beat.

Each time a file must be migrated from LFS to Xet, a webhook is triggered, pushing the event to a distributed queue where it’s processed by an orchestrator. The orchestrator:

Enables Xet on the repo if the event calls for it
Fetches a list of LFS revisions for each LFS file within the repo
Batches the files into jobs based on size or variety of files; either 1000 files or 500MB, whichever comes first
Places the roles on one other queue for migration employee pods

These migration employees then pick up the roles and every pod:

Downloads the LFS files listed within the batch
Uploads the LFS files to the Xet content addressed store using xet-core

Migration flow triggered by a webhook event; starting on the orchestrator for brevity.

Scaling migrations

In April, we tested this method’s limits by reaching out to bartowski and asking in the event that they desired to test out Xet. With nearly 500 TB across 2,000 repos, bartowski’s migration uncovered just a few weak links:

Temporary shard files for global dedupe were first written to /tmp after which moved into the shard cache. On our employee pods, nevertheless, /tmp and the Xet cache sat on different mount points. The move failed and the shard files were never removed. Eventually the disk filled, triggering a wave of No space left on device errors.
After supporting the launch of Llama 4, we would scaled CAS for bursty downloads, however the migration employees flipped the script as a whole bunch of multi-gigabyte uploads pushed CAS beyond its resources
On paper, the migration employees were able to significantly more throughput than what was reported; profiling the pods revealed network and EBS I/O bottlenecks

Fixing this three-headed monster meant touching every layer – patching xet-core, resizing CAS, and beefing up the employee node specs. Fortunately, bartowski was game to work with us while every repo made its method to Xet. These same lessons powered the moves of the largest storage users on the Hub like RichardErkhov (1.7PB and 25,000 repos) and mradermacher (6.1PB and 42,000 repos 🤯).

CAS throughput, meanwhile, has grown by an order of magnitude between the primary and latest large-scale migrations:

Bartowski migration: CAS sustained ~35 Gb/s, with ~5 Gb/s coming from regular Hub traffic.
mradermacher and RichardErkhov migrations: CAS peaked around ~300 Gb/s, while still serving ~40 Gb/s of on a regular basis load.

Zero friction, faster transfers

Once we began replacing LFS, we had two goals in mind:

Do no harm
Drive essentially the most impact as fast as possible

Designing with our initial constraints and these goals allowed us to:

Introduce and harden hf-xet before including it in huggingface_hub as a required dependency
Support the community uploading to and downloading from Xet-enabled repos through whatever means they use today while our infrastructure handles the remainder
Learn invaluable lessons – from scale to how our client operated on distributed file systems – from incrementally migrating the Hub to Xet

As an alternative of waiting for all upload paths to develop into Xet-aware, forcing a tough cut-over, or pushing the community to adopt a selected workflow, we could begin migrating the Hub to Xet immediately with minimal user impact. Briefly, let teams keep their workflows and organically transition to Xet with infrastructure supporting the long-term goal of a unified storage system.

Xet for everybody

In January and February, we onboarded power users to offer feedback and pressure-test the infrastructure. To get community feedback, we launched a waitlist to preview Xet-enabled repositories. Soon after, Xet became the default for brand new users on the Hub.

We now support among the largest creators on the Hub (Meta Llama, Google, OpenAI, and Qwen) while the community keeps working uninterrupted.

What’s next?

Starting this month, we’re bringing Xet to everyone. Look ahead to an email providing access to Xet and once you have got it, update to the most recent huggingface_hub (pip install -U huggingface_hub) to unlock faster transfers instantly. This can even mean:

All your existing repositories will migrate from LFS to Xet
All newly created repos shall be Xet-enabled by default

Should you upload or download from the Hub using your browser or use Git, that is nice. Chunk-based support for each is coming soon. Within the meantime use whichever workflow you have already got; no restrictions.

Next up: open-sourcing the Xet protocol and all the infrastructure stack. The long run of storing and moving bytes that scale to AI workloads is on the Hub, and we’re aiming to bring it to everyone.

If you have got any questions, drop us a line within the comments 👇, open a discussion on the Xet team page.

Source link

Migrating the Hub from Git LFS to Xet

Bridges and backward compatibility

Scaling migrations

Zero friction, faster transfers

Xet for everybody

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Can A Model 10,000× Smaller Outsmart ChatGPT?

NVIDIA Extreme Co-Design Delivers Latest MLPerf Inference Records

The Inversion Error: Why Secure AGI Requires an Enactive Floor and State-Space Reversibility

CUDA Tile Programming Now Available for BASIC!

Breaking the Computer Use Frontier

Migrating the Hub from Git LFS to Xet

Bridges and backward compatibility

Scaling migrations

Zero friction, faster transfers

Xet for everybody

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.