Streaming datasets: 100x More Efficient

-


We boosted load_dataset('dataset', streaming=True), streaming datasets without downloading them with one line of code!

Start training on multi-TB datasets immediately, without complex setups, downloading, no “disk out of space”, or 429 “stop requesting!” errors.
It’s super fast! Outrunning our local SSDs when training on 64xH100 with 256 employees downloading data.
We have improved streaming to have 100x fewer requests, → 10× faster data resolution → 2x sample/sec, → 0 employee crashes at 256 concurrent employees.

Visualization of a dataset being streamed

Loading data, especially on the terabyte scale, is a serious pain in any machine learning workflow. We suffered this while training SmolLM3, at one point we had to attend 3 hours before each run to download enough data.

Streaming has at all times been possible within the datasets library, but large scale training with massive datasets remained a challenge. That changes today 🔥. We spent a couple of months improving the backend, specializing in streaming datasets to make it faster and more efficient.

What did we do exactly? ⤵️



Streaming: The Same Easy API

First things first: our changes are backwards compatible. You possibly can still stream any dataset from the Hub with the identical easy streaming=True flag. It’s as easy as ever. 🚀

from datasets import load_dataset


dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)

print(next(iter(dataset)))

Hundreds of AI developers around the globe use datasets every day; they need to just get improved performance with zero extra work.



The Challenge: Streaming at Scale

Streaming was a lifesaver to quickly understand a dataset, but to coach models, people were often downloading the information locally, or using a cloud storage service resembling S3. That is what we were doing for training SmolVLM, we had all of our data on S3 and were streaming directly from it.

We wanted to vary that, so we decided to make use of streaming from the Hub once we were developing nanoVLM. Soon we found an enormous issue: our test run generated over 100,000 requests in under a minute, which got our IP blocked by the Hub! 😅 This happened because every DataLoader employee was initializing the dataset independently. As we dug deeper, we found that this creates a storm of redundant requests, lots of that are unnecessary. Our changes ultimately reduced startup requests by an element of 100. In total, our improvements delivered:

  • Data files resolution time: 10x faster
  • Startup requests: As much as 100x more efficient
  • Streaming speed: As much as 2x faster
  • In-flight requests: As much as 2x more efficient



Under the Hood: What We Improved

So, what modified? We focused on two phases: startup and streaming.

1. Startup⚡️
The initial resolution of knowledge files was making a ton of requests. We made two major changes:

  • Persistent Data Files Cache: We at the moment are caching the list of knowledge files across all DataLoader employees. The primary employee resolves the file list from the Hub. All others employees read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms!
  • Optimized Resolution Logic: We also minimized the variety of API calls required for that initial employee to fetch the file list. We now bundle the obligatory requests as efficiently as possible, reducing latency even further.

2. Streaming 🏎️
To enhance throughput during streaming itself, we have introduced two latest features:

  • Prefetching for Parquet: We enabled prefetching for Parquet datasets. Because of this while your model is processing the present chunk of knowledge, the datasets library is already fetching the subsequent chunk within the background. This keeps the information pipeline full and ensures your GPU is rarely left waiting for data.
  • Configurable Buffering: Advanced users can now fine-tune streaming performance for his or her specific hardware and network setup. We have exposed options to configure the buffer’s block size and the prefetch volume, supplying you with maximum control to optimize I/O.

That is how we are able to increase the minimum request size when streaming from 32MiB (default) to 128MiB and configure prefetching:

import pyarrow
import pyarrow.dataset

fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(
    cache_options=pyarrow.CacheOptions(
        prefetch_limit=1,
        range_size_limit=128 << 20
    ),
)
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

Together, these improvements can double your data throughput, allowing you to coach faster and more efficiently.



How are we faster than plain S3: Xet

Hugging Face uses Xet: a dedupe-based storage which enables fast deduped uploads and downloads. Unlike traditional distant storage, data transfers are faster on Xet because duplicated data is simply transferred once. For instance: uploading a big scale dataset to Hugging Face leverages Xet which accelerates uploads. Once the dataset is uploaded, it may possibly be streamed instantly.

Deduplication for Parquet is enabled through Parquet Content Defined Chunking (CDC). Due to Parquet CDC and Xet deduplication, uploading datasets on Hugging Face is quicker than on any traditional distant storage.

That is supported by our pyspark_huggingface package, a Spark Data Source to read/write HF datasets. It includes Parquet CDC and Xet support, accelerating data transfers on HF dramatically.



Need a custom streaming pipeline ?

Some data file formats are usually not supported in datasets, and sometimes there may be a necessity for more control, so we made it easy to construct custom streaming pipelines. This has been battle-tested within the LeRobot library to sample video frames, and within the WebDataset library to stream TAR archives.

We improved the HfFileSystem within the huggingface_hub library to efficiently read files from distant Hugging Face dataset repositories and stream data:

from huggingface_hub import HfFileSystem

path = f"hf://datasets/{dataset_id}/{path_in_repo}"
with HfFileSystem().open(path) as f:
    
    

Passing a HfFileSystem to a torch DataLoader reuses the cached results from .ls() and .glob() which eliminates the necessity for extra requests when listing data files.



Push streaming to the limit

We’re now using these streaming enhancements in nanoVLM to coach the subsequent generation of SmolVLMs. With these tweaks, we achieve higher performance from streaming than from training on our cluster’s hierarchical hard disk setup. In truth, streaming is now as fast as reading the information from local SSDs! Previously, transferring data to local SSDs was the method that used to delay our trainings by three-hours. For more details, take a look at our GitHub.



Get Began and See the Difference

These powerful latest features landed within the datasets and huggingface_hub libraries. To reap the benefits of them, simply update your libraries and take a look at the documentation:

pip install --upgrade datasets huggingface_hub

To have fun this, we preconcatenated and shuffled all the information sources in FineVision into FineVisionMax. You should utilize this single combined dataset to coach your VLM – no must handle multiple datasets manually!

from datasets import load_dataset


dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)

print(next(iter(dataset)))

And you possibly can see how we do it at scale in nanoVLM!

Joyful streaming! 🤗



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x