Bringing large-scale datasets to `lerobot`

TL;DR Today we release LeRobotDataset:v3! In our previous LeRobotDataset:v2 release, we stored one episode per file, hitting file-system limitations when scaling datasets to thousands and thousands of episodes. LeRobotDataset:v3 packs multiple episodes in a single file, using relational metadata to retrieve information at the person episode level from multi-episode files. The brand new format also natively supports accessing datasets in streaming mode, allowing to process large datasets on the fly.We offer a one-liner util to convert all datasets within the LeRobotDataset format to the brand new format, and are very excited to share this milestone with the community ahead of our next stable release!

LeRobotDataset, v3.0

LeRobotDataset is a standardized dataset format designed to deal with the particular needs of robot learning, and it provides unified and convenient access to robotics data across modalities, including sensorimotor readings, multiple camera feeds and teleoperation status.
Our dataset format also stores general information regarding the way in which the info is being collected (metadata), including a textual description of the duty being performed, the type of robot used and measurement details just like the frames per second at which each image and robot state streams are sampled.
Metadata are useful to index and search across robotics datasets on the Hugging Face Hub!

Inside lerobot, the robotics library we’re developing at Hugging Face, LeRobotDataset provides a unified interface for working with multi-modal, time-series data, and it seamlessly integrates each with the Hugging Face and Pytorch ecosystems.
The dataset format is designed to be easily extensible and customizable, and already supports openly available datasets from a big selection of embodiments—including manipulator platforms equivalent to the SO-100 arms and ALOHA-2 setup, real-world humanoid data, simulation datasets, and even self-driving automobile data!
You’ll be able to explore the present datasets contributed by the community using the dataset visualizer! 🔗

Besides scale, this latest release of LeRobotDataset also enables support for a streaming functionality, allowing to process batches of knowledge from large datasets on the fly, without having to download prohibitively large collections of knowledge onto disk.
You’ll be able to access and use any dataset in v3.0 in streaming mode through the use of the dedicated StreamingLeRobotDataset interface!
Streaming datasets is a key milestone towards more accessible robot learning, and we’re enthusiastic about sharing it with the community 🤗

LeRobotDataset v3 diagram — From episode-based to file-based datasets

StreamingLeRobotDataset — We directly enable dataset streaming from the Hugging Face Hub for on-the-fly processing.

Install `lerobot`, and record a dataset

lerobot is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics in addition to cutting-edge robot learning algorithms.
The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
You’ll be able to read more in regards to the robots we currently support here!

LeRobotDataset:v3 goes to be a component of the lerobot library ranging from lerobot-v0.4.0, and we’re very enthusiastic about sharing it early with the community. You’ll be able to install the newest lerobot-v0.3.x supporting this latest dataset format directly from PyPI using:

pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip"

Follow the community’s progress towards a stable release of the library here 🤗

Once you will have installed a version of lerobot which supports the brand new dataset format, you possibly can record a dataset with our signature robot arm, the SO-101, through the use of teleoperation alongside the next instructions:

lerobot-record 
    --robot.type=so101_follower 
    --robot.port=/dev/tty.usbmodem585A0076841 
    --robot.id=my_awesome_follower_arm 
    --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" 
    --teleop.type=so101_leader 
    --teleop.port=/dev/tty.usbmodem58760431551 
    --teleop.id=my_awesome_leader_arm 
    --display_data=true 
    --dataset.repo_id=${HF_USER}/record-test 
    --dataset.num_episodes=5 
    --dataset.single_task="Grab the black cube"

Head to the official documentation to see tips on how to record a dataset to your use case.

A core design selection behind LeRobotDataset is separating the underlying data storage from the user-facing API.
This permits for efficient serialization and storage while presenting the info in an intuitive, ready-to-use format. Datasets are organized into three principal components:

Tabular Data: Low-dimensional, high-frequency data equivalent to joint states, and actions are stored in efficient Apache Parquet files, and typically offloaded to the more mature datasets library, providing fast, memory-mapped access or streaming-based access.
Visual Data: To handle large volumes of camera data, frames are concatenated and encoded into MP4 files. Frames from the identical episode are all the time grouped together into the identical video, and multiple videos are grouped together by camera. To scale back stress on the file system, groups of videos for a similar camera view are also broken into multiple subdirectories.
Metadata: A set of JSON files which describes the dataset’s structure when it comes to its metadata, serving because the relational counterpart to each the tabular and visual dimensions of knowledge. Metadata includes the several feature schemas, frame rates, normalization statistics, and episode boundaries.

To support datasets with potentially thousands and thousands of episodes (leading to lots of of thousands and thousands/billions of individual frames), we merge data from different episodes into the identical high-level structure.
Concretely, which means any given tabular collection and video won’t contain details about one episode only, but a concatenation of the knowledge available in multiple episodes.
This keeps the pressure on the file system manageable, each locally and on distant storage providers like Hugging Face.
We will then leverage metadata to collect episode-specific information, e.g. the timestamp a given episode starts or ends in a certain video.

Datasets are organized as repositories containing:

meta/info.json: That is the central metadata file. It comprises the whole dataset schema, defining all features (e.g., remark.state, motion), their shapes, and data types. It also stores crucial information just like the dataset’s frames-per-second (fps), codebase version, and the trail templates used to locate data and video files.
meta/stats.json: This file stores aggregated statistics (mean, std, min, max) for every feature across your complete dataset. These are used for data normalization and are accessible via dataset.meta.stats.
meta/tasks.jsonl: Comprises the mapping from natural language task descriptions to integer task indices, that are used for task-conditioned policy training.
meta/episodes/: This directory comprises metadata about each individual episode, equivalent to its length, corresponding task, and tips to where its data is stored. For scalability, this information is stored in chunked Parquet files quite than a single large JSON file.
data/: Comprises the core frame-by-frame tabular data in Parquet files. To enhance performance and handle large datasets, data from multiple episodes are concatenated into larger files. These files are organized into chunked subdirectories to maintain file sizes manageable. Due to this fact, a single file typically comprises data for multiple episode.
videos/: Comprises the MP4 video files for all visual remark streams. Just like the data/ directory, video footage from multiple episodes is concatenated into single MP4 files. This strategy significantly reduces the variety of files within the dataset, which is more efficient for contemporary file systems. The trail structure (/videos///file_...mp4) allows the info loader to locate the right video file after which seek to the precise timestamp for a given frame.

Migrate your `v2.1` dataset to `v3.0`

LeRobotDataset:v3.0 will likely be released with lerobot-v0.4.0, along with the chance to simply convert any dataset currently hosted on the Hugging Face Hub to the brand new v3.0 using:

python -m lerobot.datasets.v30.convert_dataset_v21_to_v30--repo-id=

We’re very enthusiastic about sharing this latest format early with the community! While we develop lerobot-v0.4.0, you possibly can still convert your dataset to the newly updated version through the use of the newest lerobot-v0.3.x supporting this latest dataset format directly from PyPI using:

pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip"  
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=

Note that this can be a pre-release, and usually unstable version. You’ll be able to follow the status of the event of our next stable release here!

The conversion script convert_dataset_v21_to_v30.py aggregates the multiple episodes episode-0000.mp4, episode-0001.mp4, episode-0002.mp4, .../episode-0000.parquet, episode-0001.parquet, episode-0002.parquet, episode-0003.parquet, ... into single files file-0000.mp4/file-0000.parquet, and updates the metadata accordingly, to give you the option to retrieve episode-specific information from higher-level files.

Code Example: Using `LeRobotDataset` with `torch.utils.data.DataLoader`

Every dataset on the Hugging Face Hub containing the three principal pillars presented above (Tabular and Visual Data, in addition to relational Metadata), and could be accessed with a single line.

Most robot learning algorithms, based on reinforcement learning (RL) or behavioral cloning (BC), are inclined to operate on a stack of observations and actions.
As an illustration, RL algorithms typically use a history of previous observations o_{t-H_o:t}, and
BC algorithms are as an alternative typically trained to regress chunks of multiple actions.
To accommodate for the specifics of robot learning training, LeRobotDataset provides a native windowing operation, whereby we will use the seconds before and after any given remark using a delta_timestamps argument.

Conveniently, through the use of LeRobotDataset with a PyTorch DataLoader one can mechanically collate the person sample dictionaries from the dataset right into a single dictionary of batched tensors.

from lerobot.datasets.lerobot_dataset import LeRobotDataset

repo_id = "yaak-ai/L2D-v3"


dataset = LeRobotDataset(repo_id)


sample = dataset[100]
print(sample)








delta_timestamps = {
    "remark.images.front_left": [-0.2, -0.1, 0.0]  
}
dataset = LeRobotDataset(
    repo_id
    delta_timestamps=delta_timestamps
)


sample = dataset[100]



print(sample['observation.images.front_left'].shape)

batch_size=16

data_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size
)


num_epochs = 1
device = "cuda" if torch.cuda.is_available() else "cpu"

for epoch in range(num_epochs):
    for batch in data_loader:
        
        

        
        

        
        observations = batch['observation.state.vehicle'].to(device)
        actions = batch['action.continuous'].to(device)
        images = batch['observation.images.front_left'].to(device)

        
        ...

Streaming

You may also use any dataset in v3.0 format in streaming mode, without the necessity to download it locally, through the use of the StreamingLeRobotDataset class.

from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset

repo_id = "yaak-ai/L2D-v3"
dataset = StreamingLeRobotDataset(repo_id)

Conclusion

LeRobotDataset v3.0 is a stepping stone towards scaling up robotics datasets supported in LeRobot. By providing a format to store and access large collections of robot data we’re making progress towards democratizing robotics, allowing the community to coach on possibly thousands and thousands of episodes without even downloading the info itself!

You’ll be able to try the brand new dataset format by installing the newest lerobot-v0.3.x, and share any feedback on GitHub or on our Discord server! 🤗

Acknowledgements

We thank the incredible yaak.ai team for his or her precious support and feedback while developing LeRobotDataset:v3.
Go ahead and follow their organization on the Hugging Face Hub!
We’re all the time seeking to collaborate with the community and share early features. Reach out for those who would love to collaborate 😊

Source link

Bringing large-scale datasets to `lerobot`

Table of Contents

LeRobotDataset, v3.0

Install `lerobot`, and record a dataset

Migrate your `v2.1` dataset to `v3.0`

Code Example: Using `LeRobotDataset` with `torch.utils.data.DataLoader`

Streaming

Conclusion

Acknowledgements

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Blackwell NVFP4 Kernel Hackathon · Luma

Visible Watermarking with Gradio

Anthropic enters the frontier AI fight

A Hands-On Guide to Anthropic’s Recent Structured Output Capabilities

Make Sense of Video Analytics by Integrating NVIDIA AI Blueprints

Bringing large-scale datasets to `lerobot`

Table of Contents

LeRobotDataset, v3.0

Install lerobot, and record a dataset

Migrate your v2.1 dataset to v3.0

Code Example: Using LeRobotDataset with torch.utils.data.DataLoader

Streaming

Conclusion

Acknowledgements

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Install `lerobot`, and record a dataset

Migrate your `v2.1` dataset to `v3.0`

Code Example: Using `LeRobotDataset` with `torch.utils.data.DataLoader`

What are your thoughts on this topic?
Let us know in the comments below.