Introducing Storage Buckets on the Hugging Face Hub

-



Hugging Face Models and Datasets repos are great for publishing final artifacts. But production ML generates a continuing stream of intermediate files (checkpoints, optimizer states, processed shards, logs, traces, etc.) that change often, arrive from many roles without delay, and infrequently need version control.

Storage Buckets are built exactly for this: mutable, S3-like object storage you may browse on the Hub, script from Python, or manage with the hf CLI. And since they’re backed by Xet, they’re especially efficient for ML artifacts that share content across files.



Why we built Buckets

Git starts to feel just like the incorrect abstraction pretty quickly if you’re coping with:

  • Training clusters writing checkpoints and optimizer states throughout a run
  • Data pipelines processing raw datasets iteratively
  • Agents storing traces, memory, and shared knowledge graphs

The storage need in all these cases is similar: write fast, overwrite when needed, sync directories, remove stale files, and keep things moving.

A Bucket is a non-versioned storage container on the Hub. It lives under a user or organization namespace, has standard Hugging Face permissions, might be private or public, has a page you may open in your browser, and might be addressed programmatically with a handle like hf://buckets/username/my-training-bucket.



Why Xet matters

Buckets are built on Xet, Hugging Face’s chunk-based storage backend, and this matters greater than it might sound.

As an alternative of treating files as monolithic blobs, Xet breaks content into chunks and deduplicates across them. Upload a processed dataset that’s mostly much like the raw one? Many chunks exist already. Store successive checkpoints where large parts of the model are frozen? Same story. Buckets skip the bytes which might be already there, which suggests less bandwidth, faster transfers, and more efficient storage.

It is a natural fit for ML workloads. Training pipelines always produce families of related artifacts — raw and processed data, successive checkpoints, Agent traces and derived summaries — and Xet is designed to benefit from that overlap.

For Enterprise customers, billing relies on deduplicated storage, so shared chunks directly reduce the billed footprint. Deduplication helps with each speed and value.



Pre-warming: bringing data near compute

Buckets survive the Hub, which suggests global storage by default. But not every workload can afford to tug data from wherever it happens to live, for distributed training and large-scale pipelines, storage location directly affects throughput.

Pre-warming helps you to bring hot data closer to the cloud provider and region where your compute runs. As an alternative of knowledge traveling across regions on every read, you declare where you wish it and Buckets ensure it’s already there when your jobs start. This is very useful for training clusters that need fast access to large datasets or checkpoints, and for multi-region setups where different parts of a pipeline run in numerous clouds.

We’re partnering with AWS and GCP to start out with, more more cloud providers coming in the long run.



Getting began

You possibly can get a bucket up and running in under 2 minutes with the hf CLI. Very first thing is to put in it and log in:

curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login

Create a bucket on your project:

hf buckets create my-training-bucket --private

Say your training job is writing checkpoints locally to ./checkpoints. Sync that directory into the Bucket:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints

For big transfers, it is advisable to see what is going to occur before anything moves. --dry-run prints the plan without executing anything:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --dry-run

It’s also possible to save the plan to a file for review and apply it later:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --plan sync-plan.jsonl
hf buckets sync --apply sync-plan.jsonl

Once done, inspect the Bucket from the CLI:

hf buckets list username/my-training-bucket -h

or browse it directly on the Hub at https://huggingface.co/buckets/username/my-training-bucket.

That’s the whole loop. Create a bucket, sync your working data into it, check on it when you want to, and save the versioned repo for when something is value publishing. For one-off operations, hf buckets cp copies individual files and hf buckets remove cleans up stale objects.



Using Buckets from Python

Every little thing above also works from Python via huggingface_hub (available since v1.5.0). The API follows the identical pattern: create, sync, inspect.

from huggingface_hub import create_bucket, list_bucket_tree, sync_bucket

create_bucket("my-training-bucket", private=True, exist_ok=True)

sync_bucket(
    "./checkpoints",
    "hf://buckets/username/my-training-bucket/checkpoints",
)

for item in list_bucket_tree(
    "username/my-training-bucket",
    prefix="checkpoints",
    recursive=True,
):
    print(item.path, item.size)

This makes it straightforward to integrate Buckets into training scripts, data pipelines, or any service that manages artifacts programmatically. The Python client also supports batch uploads, selective downloads, deletes, and bucket moves for if you need finer control.

Bucket support can also be available in JavaScript via @huggingface/hub (since v2.10.5), so you may integrate Buckets into Node.js services and web applications as well.



Filesystem integration

Buckets also work through HfFileSystem, the fsspec-compatible filesystem in huggingface_hub. This implies you may list, read, write, and glob Bucket contents using standard filesystem operations — and any library that supports fsspec can access Buckets directly.

from huggingface_hub import hffs


hffs.ls("buckets/username/my-training-bucket/checkpoints", detail=False)


hffs.glob("buckets/username/my-training-bucket/**/*.parquet")


with hffs.open("buckets/username/my-training-bucket/config.yaml", "r") as f:
    print(f.read())

Because fsspec is the usual Python interface for distant filesystems, libraries like pandas, Polars, and Dask can read from and write to Buckets using hf:// paths with no extra setup:

import pandas as pd


df = pd.read_csv("hf://buckets/username/my-training-bucket/results.csv")


df.to_csv("hf://buckets/username/my-training-bucket/summary.csv")

This makes it easy to plug Buckets into existing data workflows without changing how your code reads or writes files.



From Buckets to versioned repos

Buckets are the fast, mutable place where artifacts live while they’re still in motion. Once something becomes a stable deliverable, it often belongs to a versioned model or dataset repo.

On the roadmap, we plan to support direct transfers between Buckets and repos in each directions: promote final checkpoint weights right into a model repo, or commit processed shards right into a dataset repo once a pipeline completes. The working layer and the publishing layer stay separate, but fit into one continuous Hub-native workflow.



Trusted by launch partners

Before opening Buckets to everyone, we ran a personal beta with a small group of launch partners.

An enormous thanks to Jasper, Arcee, IBM, and PixAI for testing early versions, surfacing bugs, and sharing feedback that directly shaped this feature.



Conclusion and resources

Storage Buckets bring a missing storage layer to the Hub. They provide you with a Hub-native place for the mutable, high-throughput side of ML: checkpoints, processed data, Agent traces, logs, and all the things else that is beneficial before it becomes final.

Because they’re built on Xet, Buckets aren’t just easier to make use of than forcing all the things through Git. Also they are more efficient for the sorts of related artifacts AI systems produce on a regular basis. Which means faster transfers, higher deduplication, and on Enterprise plans, billing that advantages from the deduplicated footprint.

In the event you already use the Hub, Buckets allow you to keep more of your workflow in a single place. In the event you come from S3-style storage, they provide you with a well-known model with higher alignment to AI artifacts and a transparent path toward final publication on the Hub.

Buckets are included in existing Hub storage plans. Free accounts include storage to start, and PRO and Enterprise plans offer higher limits. See the storage page for details.

Read more and check out it yourself:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x