Scaling robotics datasets with video encoding

Over the past few years, text and image-based models have seen dramatic performance improvements, primarily on account of scaling up model weights and dataset sizes. While the web provides an intensive database of text and pictures for LLMs and image generation models, robotics lacks such an enormous and diverse qualitative data source and efficient data formats. Despite efforts like Open X, we’re still removed from achieving the size and variety seen with Large Language Models. Moreover, we lack the needed tools for this endeavor, akin to dataset formats which are lightweight, fast to load from, easy to share and visualize online. This gap is what 🤗 LeRobot goals to handle.

What’s a dataset in robotics?

Of their general form — at the least the one we’re occupied with inside an end-to-end learning framework — robotics datasets typically are available in two modalities: the visual modality and the robot’s proprioception / goal positions modality (state/motion vectors). Here’s what this could appear to be in practice:

Until now, the very best approach to store visual modality was PNG for individual frames. This could be very redundant as there’s numerous repetition among the many frames. Practitioners didn’t use videos since the loading times may very well be orders of magnitude above. These datasets are often released in various formats from academic papers (hdf5, zarr, pickle, tar, zip…). As of late, modern video codecs can achieve impressive compression ratios — meaning the dimensions of the encoded video in comparison with the unique uncompressed frames — while still preserving excellent quality. Because of this with a compression ratio of 1:20, or 5% as an illustration (which is well achievable), you get from a 20GB dataset right down to a single GB of knowledge. For this reason, we decided to make use of video encoding to store the visual modalities of our datasets.

Contribution

We propose a LeRobotDataset format that is easy, lightweight, easy to share (with native integration to the hub) and straightforward to visualise.
Our datasets are on average 14% the dimensions their original version (reaching as much as 0.2% in the very best case) while preserving full training capabilities on them by maintaining a excellent level of quality. Moreover, we observed decoding times of video frames to follow this pattern, depending on resolution:

Within the nominal case where we’re decoding a single frame, our loading time is comparable to that of loading the frame from a compressed image (png).
Within the advantageous case where we’re decoding multiple successive frames, our loading time is 25%-50% that of loading those frames from compressed images.

On top of this, we’re constructing tools to simply understand and browse these datasets.
You may explore a couple of examples yourself in the next Spaces using our visualization tool (click the photographs):

aliberts/koch_tutorial

cadene/koch_bimanual_folding

But what’s a codec? And what’s video encoding & decoding actually doing?

At its core, video encoding reduces the dimensions of videos by utilizing mainly 2 ideas:

Spatial Compression: This is similar principle utilized in a compressed image like JPEG or PNG. Spatial compression uses the self-similarities of a picture to scale back its size. As an illustration, a single frame of a video showing a blue sky could have large areas of comparable color. Spatial compression takes advantage of this to compress these areas without losing much in quality.
Temporal Compression: Moderately than storing each frame as is, which takes up numerous space, temporal compression calculates the differences between each frame and keeps only those differences (that are generally much smaller) within the encoded video stream. At decoding time, each frame is reconstructed by applying those differences back. After all, this approach requires at the least one frame of reference to begin computing these differences with. In practice though, we use multiple placed at regular intervals. There are several reasons for this, that are detailed in this text. These “reference frames” are called keyframes or I-frames (for Intra-coded frames).

Because of these 2 ideas, video encoding is in a position to reduce the dimensions of videos right down to something manageable. Knowing this, the encoding process roughly looks like this:

Keyframes are determined based on user’s specifications and scenes changes.
Those keyframes are compressed spatially.
The frames in-between are then compressed temporally as “differences” (also called P-frames or B-frames, more on these within the article linked above).
These differences themselves are then compressed spatially.
This compressed data from I-frames, P-frames and B-frames is encoded right into a bitstream.
That video bitstream is then packaged right into a container format (MP4, MKV, AVI…) together with potentially other bitstreams (audio, subtitles) and metadata.
At this point, additional processing could also be applied to scale back any visual distortions brought on by compression and to make sure the overall video quality meets desired standards.

Obviously, it is a high-level summary of what is happening and there are numerous moving parts and configuration selections to make on this process. Logically, we wanted to guage the easiest way of doing it given our needs and constraints, so we built a benchmark to evaluate this in response to plenty of criteria.

Criteria

While size was the initial reason we decided to go along with video encoding, we soon realized that there have been other facets to think about as well. After all, decoding time is a crucial one for machine learning applications as we wish to maximise the period of time spent training fairly than loading data. Quality must stays above a certain level as well in order to not degrade our policies’ performances. Lastly, one less obvious but equally essential aspect is the compatibility of our encoded videos in an effort to be easily decoded and played on the vast majority of media player, web browser, devices etc. Having the flexibility to simply and quickly visualize the content of any of our datasets was vital feature for us.

To summarize, these are the factors we desired to optimize:

Size: Impacts storage disk space and download times.
Decoding time: Impacts training time.
Quality: Impacts training accuracy.
Compatibility: Impacts the flexibility to simply decode the video and visualize it across devices and platforms.

Obviously, a few of these criteria are in direct contradiction: you possibly can hardly e.g. reduce the file size without degrading quality and vice versa. The goal was subsequently to search out the very best compromise overall.

Note that due to our specific use case and our needs, some encoding settings traditionally used for media consumption don’t really apply to us. A great example of that’s with GOP (Group of Pictures) size. More on that in a bit.

Metrics

Given those criteria, we selected metrics accordingly.

Size compression ratio (lower is healthier): as mentioned, that is the dimensions of the encoded video over the dimensions of its set of original, unencoded frames.
Load times ratio (lower is healthier): that is the time it take to decode a given frame from a video over the time it takes to load that frame from a person image.

For quality, we checked out 3 commonly used metrics:

Average Mean Square Error (lower is healthier): the common mean square error between each decoded frame and its corresponding original image over all requested timestamps, and in addition divided by the variety of pixels within the image to be comparable across different image sizes.
Average Peak Signal to Noise Ratio (higher is healthier): measures the ratio between the utmost possible power of a signal and the facility of corrupting noise that affects the fidelity of its representation. Higher PSNR indicates higher quality.
Average Structural Similarity Index Measure (higher is healthier): evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity.

Moreover, we tried various levels of encoding quality to get a way of what these metrics translate to visually. Nevertheless, video encoding is designed to appeal to the human eye by making the most of several principles of how the human visual perception works, tricking our brains to keep up a level of perceived quality. This might need a unique impact on a neural net. Due to this fact, besides these metrics and a visible check, it was essential for us to also validate that the encoding didn’t degrade our policies performance by A/B testing it.

For compatibility, we do not have a metric per se, but it surely mainly boils right down to the video codec and the pixel format. For the video codec, the three that we selected (h264, h265 and AV1) are common and do not pose a difficulty. Nevertheless, the pixel format is very important as well and we found afterwards that on most browsers as an illustration, yuv444p will not be supported and the video cannot be decoded.

Variables

Image content & size

We do not expect the identical optimal settings for a dataset of images from a simulation, or from the true world in an apartment, or in a factory, or outdoor, or with a lot of moving objects within the scene, etc. Similarly, loading times won’t vary linearly with the image size (resolution).
For these reasons, we ran this benchmark on 4 representative datasets:

lerobot/pusht_image: (96 x 96 pixels) simulation with easy geometric shapes, fixed camera.
aliberts/aloha_mobile_shrimp_image: (480 x 640 pixels) real-world indoor, moving camera.
aliberts/paris_street: (720 x 1280 pixels) real-world outdoor, moving camera.
aliberts/kitchen: (1080 x 1920 pixels) real-world indoor, fixed camera.

Encoding parameters

We used FFmpeg for encoding our videos. Listed here are the important parameters we played with:

Video Codec (`vcodec`)

The codec (coder-decoder) is the algorithmic engine that is driving the video encoding. The codec defines a format used for encoding and decoding. Note that for a given codec, several implementations may exist. For instance for AV1: libaom (official implementation), libsvtav1 (faster, encoder only), libdav1d (decoder only).

Note that the remainder of the encoding parameters are interpreted in a different way depending on the video codec used. In other words, the identical crf value used with one codec doesn’t necessarily translate into the identical compression level with one other codec. The truth is, the default value (None) is not the identical amongst the various video codecs. Importantly, it’s also the case for a lot of other ffmpeg arguments like g which specifies the frequency of the important thing frames.

Pixel Format (`pix_fmt`)

Pixel format specifies each the color space (YUV, RGB, Grayscale) and, for YUV color space, the chroma subsampling which determines the way in which chrominance (color information) and luminance (brightness information) are literally stored within the resulting encoded bitstream. As an illustration, yuv420p indicates YUV color space with 4:2:0 chroma subsampling. That is probably the most common format for web video and standard playback. For RGB color space, this parameter specifies the variety of bits per pixel (e.g. rbg24 means RGB color space with 24 bits per pixel).

Group of Pictures size (`g`)

GOP (Group of Pictures) size determines how continuously keyframes are placed throughout the encoded bitstream. The lower that value is, the more continuously keyframes are placed. One key thing to know is that when requesting a frame at a given timestamp, unless that frame happens to be a keyframe itself, the decoder will search for the last previous keyframe before that timestamp and might want to decode each subsequent frame as much as the requested timestamp. Because of this increasing GOP size will increase the common decoding time of a frame as fewer keyframes can be found to begin from. For a typical online content akin to a video on Youtube or a movie on Netflix, a keyframe placed every 2 to 4 seconds of the video — 2s corresponding to a GOP size of 48 for a 24 fps video — will generally translate to a smooth viewer experience as this makes loading time acceptable for that use case (depending on hardware). For training a policy nevertheless, we’d like access to any frame as fast as possible meaning that we’ll probably need a much lower value of GOP.

Constant Rate Factor (`crf`)

The constant rate factor represent the quantity of lossy compression applied. A worth of 0 implies that no information is lost while a high value (around 50-60 depending on the codec used) could be very lossy.
Using this parameter fairly than specifying a goal bitrate is preferable because it allows to aim for a relentless visual quality level with a potentially variable bitrate fairly than the alternative.

This table summarizes the various values we tried for our study:

parameter	values
vcodec	`libx264`, `libx265`, `libsvtav1`
pix_fmt	`yuv444p`, `yuv420p`
g	`1`, `2`, `3`, `4`, `5`, `6`, `10`, `15`, `20`, `40`, `None`
crf	`0`, `5`, `10`, `15`, `20`, `25`, `30`, `40`, `50`, `None`

Decoding parameters

Decoder

We tested two video decoding backends from torchvision:

pyav (default)
video_reader

Timestamps scenarios

Given the way in which video decoding works, once a keyframe has been loaded, the decoding of subsequent frames is fast.
This in fact is affected by the -g parameter during encoding, which specifies the frequency of the keyframes. Given our typical use cases in robotics policies which could request a couple of timestamps in numerous random places, we wish to duplicate these use cases with the next scenarios:

1_frame: 1 frame,
2_frames: 2 consecutive frames (e.g. [t, t + 1 / fps]),
6_frames: 6 consecutive frames (e.g. [t + i / fps for i in range(6)])

Note that this differs significantly from a typical use case like watching a movie, wherein every frame is loaded sequentially from the start to the top and it’s acceptable to have big values for -g.

Moreover, because some policies might request single timestamps which are a couple of frames apart, we even have the next scenario:

2_frames_4_space: 2 frames with 4 consecutive frames of spacing in between (e.g [t, t + 5 / fps]),

Nevertheless, on account of how video decoding is implemented with pyav, we do not have access to an accurate seek so in practice this scenario is actually the identical as 6_frames since all 6 frames between t and t + 5 / fps might be decoded.

Results

After running this study, we switched to a unique encoding from v1.6 on.

codebase version	v1.5	v1.6
vcodec	`libx264`	`libsvtav1`
pix-fmt	`yuv444p`	`yuv420p`
g	`2`	`2`
crf	`None` (=`23`)	`30`

We managed to realize more quality because of AV1 encoding while using the more compatible yuv420p pixel format.

Sizes

We achieved a mean compression ratio of about 14% across the full dataset sizes. Most of our datasets are reduced to under 40% of their original size, with some being lower than 1%. These variations might be attributed to the varied formats from which these datasets originate. Datasets with the best size reductions often contain uncompressed images, allowing the encoder’s temporal and spatial compression to drastically reduce their sizes. Then again, datasets where images were already stored using a type of spatial compression (akin to JPEG or PNG) have experienced less reduction in size. Other aspects, akin to image resolution, also affect the effectiveness of video compression.

Table 1: Dataset sizes comparison

repo_id	raw	ours (v1.6)	ratio (ours/raw)
lerobot/nyu_rot_dataset	5.3MB	318.2KB	5.8%
lerobot/pusht	29.6MB	7.5MB	25.3%
lerobot/utokyo_saytap	55.4MB	6.5MB	11.8%
lerobot/imperialcollege_sawyer_wrist_cam	81.9MB	3.8MB	4.6%
lerobot/utokyo_xarm_bimanual	138.5MB	8.1MB	5.9%
lerobot/unitreeh1_two_robot_greeting	181.2MB	79.0MB	43.6%
lerobot/usc_cloth_sim	254.5MB	23.7MB	9.3%
lerobot/unitreeh1_rearrange_objects	283.3MB	138.4MB	48.8%
lerobot/tokyo_u_lsmo	335.7MB	22.8MB	6.8%
lerobot/utokyo_pr2_opening_fridge	360.6MB	29.2MB	8.1%
lerobot/aloha_static_pingpong_test	480.9MB	168.5MB	35.0%
lerobot/cmu_franka_exploration_dataset	602.3MB	18.2MB	3.0%
lerobot/unitreeh1_warehouse	666.7MB	236.9MB	35.5%
lerobot/cmu_stretch	728.1MB	38.7MB	5.3%
lerobot/asu_table_top	737.6MB	39.1MB	5.3%
lerobot/xarm_push_medium	808.5MB	15.9MB	2.0%
lerobot/xarm_push_medium_replay	808.5MB	17.8MB	2.2%
lerobot/xarm_lift_medium_replay	808.6MB	18.4MB	2.3%
lerobot/xarm_lift_medium	808.6MB	17.3MB	2.1%
lerobot/utokyo_pr2_tabletop_manipulation	829.4MB	40.6MB	4.9%
lerobot/utokyo_xarm_pick_and_place	1.3GB	54.6MB	4.1%
lerobot/aloha_static_ziploc_slide	1.3GB	498.4MB	37.2%
lerobot/ucsd_kitchen_dataset	1.3GB	46.5MB	3.4%
lerobot/berkeley_gnm_cory_hall	1.4GB	85.6MB	6.0%
lerobot/aloha_static_thread_velcro	1.5GB	1.1GB	73.2%
lerobot/austin_buds_dataset	1.5GB	87.8MB	5.7%
lerobot/aloha_static_screw_driver	1.5GB	507.8MB	33.1%
lerobot/aloha_static_cups_open	1.6GB	486.3MB	30.4%
lerobot/aloha_static_towel	1.6GB	565.3MB	34.0%
lerobot/dlr_sara_grid_clamp	1.7GB	93.6MB	5.5%
lerobot/unitreeh1_fold_clothes	2.0GB	922.0MB	44.5%
lerobot/droid_100^*	2.0GB	443.0MB	21.2%
lerobot/aloha_static_battery	2.3GB	770.5MB	33.0%
lerobot/aloha_static_tape	2.5GB	829.6MB	32.5%
lerobot/aloha_static_candy	2.6GB	833.4MB	31.5%
lerobot/conq_hose_manipulation	2.7GB	634.9MB	23.4%
lerobot/columbia_cairlab_pusht_real	2.8GB	84.8MB	3.0%
lerobot/dlr_sara_pour	2.9GB	153.1MB	5.1%
lerobot/dlr_edan_shared_control	3.1GB	138.4MB	4.4%
lerobot/aloha_static_vinh_cup	3.1GB	1.0GB	32.3%
lerobot/aloha_static_vinh_cup_left	3.5GB	1.1GB	32.1%
lerobot/ucsd_pick_and_place_dataset	3.5GB	125.8MB	3.5%
lerobot/aloha_mobile_elevator	3.7GB	558.5MB	14.8%
lerobot/aloha_mobile_shrimp	3.9GB	1.3GB	34.6%
lerobot/aloha_mobile_wash_pan	4.0GB	1.1GB	26.5%
lerobot/aloha_mobile_wipe_wine	4.3GB	1.2GB	28.0%
lerobot/aloha_static_fork_pick_up	4.6GB	1.4GB	31.6%
lerobot/berkeley_cable_routing	4.7GB	309.3MB	6.5%
lerobot/aloha_static_coffee	4.7GB	1.5GB	31.3%
lerobot/nyu_franka_play_dataset^*	5.2GB	192.1MB	3.6%
lerobot/aloha_static_coffee_new	6.1GB	1.9GB	31.5%
lerobot/austin_sirius_dataset	6.5GB	428.7MB	6.4%
lerobot/cmu_play_fusion	6.7GB	470.2MB	6.9%
lerobot/berkeley_gnm_sac_son^*	7.0GB	501.4MB	7.0%
lerobot/aloha_mobile_cabinet	7.0GB	1.6GB	23.2%
lerobot/nyu_door_opening_surprising_effectiveness	7.1GB	378.4MB	5.2%
lerobot/aloha_mobile_chair	7.4GB	2.0GB	27.2%
lerobot/berkeley_fanuc_manipulation	8.9GB	312.8MB	3.5%
lerobot/jaco_play	9.2GB	411.1MB	4.3%
lerobot/viola	10.4GB	873.6MB	8.2%
lerobot/kaist_nonprehensile	11.7GB	203.1MB	1.7%
lerobot/berkeley_mvp	12.3GB	127.0MB	1.0%
lerobot/uiuc_d3field^*	15.8GB	1.4GB	9.1%
lerobot/umi_cup_in_the_wild	16.8GB	2.9GB	17.6%
lerobot/aloha_sim_transfer_cube_human	17.9GB	66.7MB	0.4%
lerobot/aloha_sim_insertion_scripted	17.9GB	67.6MB	0.4%
lerobot/aloha_sim_transfer_cube_scripted	17.9GB	68.5MB	0.4%
lerobot/berkeley_gnm_recon^*	18.7GB	29.3MB	0.2%
lerobot/austin_sailor_dataset	18.8GB	1.1GB	6.0%
lerobot/utaustin_mutex	20.8GB	1.4GB	6.6%
lerobot/aloha_static_pro_pencil	21.1GB	504.0MB	2.3%
lerobot/aloha_sim_insertion_human	21.5GB	87.3MB	0.4%
lerobot/stanford_kuka_multimodal_dataset	32.0GB	269.9MB	0.8%
lerobot/berkeley_rpt	40.6GB	1.1GB	2.7%
lerobot/roboturk^*	45.4GB	1.9GB	4.1%
lerobot/iamlab_cmu_pickup_insert	50.3GB	1.8GB	3.6%
lerobot/stanford_hydra_dataset	72.5GB	2.9GB	4.0%
lerobot/berkeley_autolab_ur5^*	76.4GB	14.4GB	18.9%
lerobot/stanford_robocook^*	124.6GB	3.8GB	3.1%
lerobot/toto	127.7GB	5.3GB	4.1%
lerobot/fmb^*	356.5GB	4.2GB	1.2%

^*These datasets contain depth maps which weren’t included in our format.

Loading times

Because of video encoding, our loading times scale significantly better with the resolution. This is particularly true in advantageous scenarios where we decode multiple successive frames.

1 frame	2 frames	6 frames

Summary

The total results of our study can be found in this spreadsheet. The tables below show the averaged results for g=2 and crf=30, using backend=pyav and in all timestamps-modes (1_frame, 2_frames, 6_frames).

Table 2: Ratio of video size and pictures size (lower is healthier)

repo_id	Mega Pixels	libx264		libx265		libsvtav1
repo_id	Mega Pixels	yuv420p	yuv444p	yuv420p	yuv444p	yuv420p
lerobot/pusht_image	0.01	16.97%	17.58%	18.57%	18.86%	22.06%
aliberts/aloha_mobile_shrimp_image	0.31	2.14%	2.11%	1.38%	1.37%	5.59%
aliberts/paris_street	0.92	2.12%	2.13%	1.54%	1.54%	4.43%
aliberts/kitchen	2.07	1.40%	1.39%	1.00%	1.00%	2.52%

Table 3: Ratio of video and pictures loading times (lower is healthier)

repo_id	Mega Pixels	libx264		libx265		libsvtav1
repo_id	Mega Pixels	yuv420p	yuv444p	yuv420p	yuv444p	yuv420p
lerobot/pusht_image	0.01	25.04	29.14	4.16	4.66	4.52
aliberts/aloha_mobile_shrimp_image	0.31	63.56	58.18	1.60	2.04	1.00
aliberts/paris_street	0.92	3.89	3.76	0.51	0.71	0.48
aliberts/kitchen	2.07	2.68	1.94	0.36	0.58	0.38

Table 4: Quality (mse: lower is healthier, psnr & ssim: higher is healthier)

repo_id	Mega Pixels	Values	libx264		libx265		libsvtav1
repo_id	Mega Pixels	Values	yuv420p	yuv444p	yuv420p	yuv444p	yuv420p
lerobot/pusht_image	0.01	mse	2.93E-04	2.09E-04	3.84E-04	3.02E-04	2.23E-04
		psnr	35.42	36.97	35.06	36.69	37.12
		ssim	98.29%	98.83%	98.17%	98.69%	98.70%
aliberts/aloha_mobile_shrimp_image	0.31	mse	3.19E-04	3.02E-04	5.30E-04	5.17E-04	2.18E-04
		psnr	35.80	36.10	35.01	35.23	39.83
		ssim	95.20%	95.20%	94.51%	94.56%	97.52%
aliberts/paris_street	0.92	mse	5.34E-04	5.16E-04	9.18E-03	9.17E-03	3.09E-04
		psnr	33.55	33.75	29.96	30.06	35.41
		ssim	93.94%	93.93%	83.11%	83.11%	95.50%
aliberts/kitchen	2.07	mse	2.32E-04	2.06E-04	6.87E-04	6.75E-04	1.32E-04
		psnr	36.77	37.38	35.27	35.50	39.20
		ssim	95.47%	95.58%	95.11%	95.13%	96.84%

Policies

We validated that this recent format didn’t impact performance on trained policies by training a few of them on our format. The performances of those policies were on par with those trained on the image versions.

Figure 1: Training curves for Diffusion policy on pusht dataset

Figure 2: Training curves for ACT policy on an aloha dataset

Policies have also been trained and evaluated on AV1-encoded datasets and compared against our previous reference (h264):

Diffusion on pusht:
ACT on aloha_sim_transfer_cube_human:
ACT on aloha_sim_insertion_scripted:

Future work

Video encoding/decoding is an enormous and complicated subject, and we’re only scratching the surface here. Listed here are among the things we left over on this experiment:

For the encoding, additional encoding parameters exist that aren’t included on this benchmark. Particularly:

-preset which allows for choosing encoding presets. This represents a set of options that can provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is taken into account to be medium for libx264 and libx265 and 8 for libsvtav1.
-tune which allows to optimize the encoding for certain facets (e.g. film quality, live, etc.). Particularly, a fast decode option is on the market to optimise the encoded bit stream for faster decoding.
two-pass encoding would even be interesting to take a look at because it increases quality, even though it is prone to increase encoding time significantly. Note that since we’re primarily occupied with decoding performance (as encoding is barely done once before uploading a dataset), we didn’t measure encoding times nor have any metrics regarding encoding. Using a 1-pass encoding didn’t pose any issue and it didn’t take a major period of time during this benchmark (with the condition of using libsvtav1 as a substitute of libaom for AV1 encoding).

The more detailed and comprehensive list of those parameters and others is on the market on the codecs documentations:

Similarly on the decoding side, other decoders exist but aren’t implemented in our current benchmark. To call a couple of:

torchcodec
torchaudio
ffmpegio
decord
nvc

Finally, we didn’t look into video encoding with depth maps. Although we did port datasets that include depth maps images, we aren’t using that modality for now.

Source link