Over the past few years, text and image-based models have seen dramatic performance improvements, primarily on account of scaling up model weights and dataset sizes. While the web provides an intensive database of text and pictures for LLMs and image generation models, robotics lacks such an enormous and diverse qualitative data source and efficient data formats. Despite efforts like Open X, we’re still removed from achieving the size and variety seen with Large Language Models. Moreover, we lack the needed tools for this endeavor, akin to dataset formats which are lightweight, fast to load from, easy to share and visualize online. This gap is what 🤗 LeRobot goals to handle.
What’s a dataset in robotics?
Of their general form — at the least the one we’re occupied with inside an end-to-end learning framework — robotics datasets typically are available in two modalities: the visual modality and the robot’s proprioception / goal positions modality (state/motion vectors). Here’s what this could appear to be in practice:
Until now, the very best approach to store visual modality was PNG for individual frames. This could be very redundant as there’s numerous repetition among the many frames. Practitioners didn’t use videos since the loading times may very well be orders of magnitude above. These datasets are often released in various formats from academic papers (hdf5, zarr, pickle, tar, zip…). As of late, modern video codecs can achieve impressive compression ratios — meaning the dimensions of the encoded video in comparison with the unique uncompressed frames — while still preserving excellent quality. Because of this with a compression ratio of 1:20, or 5% as an illustration (which is well achievable), you get from a 20GB dataset right down to a single GB of knowledge. For this reason, we decided to make use of video encoding to store the visual modalities of our datasets.
Contribution
We propose a LeRobotDataset format that is easy, lightweight, easy to share (with native integration to the hub) and straightforward to visualise.
Our datasets are on average 14% the dimensions their original version (reaching as much as 0.2% in the very best case) while preserving full training capabilities on them by maintaining a excellent level of quality. Moreover, we observed decoding times of video frames to follow this pattern, depending on resolution:
- Within the nominal case where we’re decoding a single frame, our loading time is comparable to that of loading the frame from a compressed image (png).
- Within the advantageous case where we’re decoding multiple successive frames, our loading time is 25%-50% that of loading those frames from compressed images.
On top of this, we’re constructing tools to simply understand and browse these datasets.
You may explore a couple of examples yourself in the next Spaces using our visualization tool (click the photographs):
But what’s a codec? And what’s video encoding & decoding actually doing?
At its core, video encoding reduces the dimensions of videos by utilizing mainly 2 ideas:
-
Spatial Compression: This is similar principle utilized in a compressed image like JPEG or PNG. Spatial compression uses the self-similarities of a picture to scale back its size. As an illustration, a single frame of a video showing a blue sky could have large areas of comparable color. Spatial compression takes advantage of this to compress these areas without losing much in quality.
-
Temporal Compression: Moderately than storing each frame as is, which takes up numerous space, temporal compression calculates the differences between each frame and keeps only those differences (that are generally much smaller) within the encoded video stream. At decoding time, each frame is reconstructed by applying those differences back. After all, this approach requires at the least one frame of reference to begin computing these differences with. In practice though, we use multiple placed at regular intervals. There are several reasons for this, that are detailed in this text. These “reference frames” are called keyframes or I-frames (for Intra-coded frames).
Because of these 2 ideas, video encoding is in a position to reduce the dimensions of videos right down to something manageable. Knowing this, the encoding process roughly looks like this:
- Keyframes are determined based on user’s specifications and scenes changes.
- Those keyframes are compressed spatially.
- The frames in-between are then compressed temporally as “differences” (also called P-frames or B-frames, more on these within the article linked above).
- These differences themselves are then compressed spatially.
- This compressed data from I-frames, P-frames and B-frames is encoded right into a bitstream.
- That video bitstream is then packaged right into a container format (MP4, MKV, AVI…) together with potentially other bitstreams (audio, subtitles) and metadata.
- At this point, additional processing could also be applied to scale back any visual distortions brought on by compression and to make sure the overall video quality meets desired standards.
Obviously, it is a high-level summary of what is happening and there are numerous moving parts and configuration selections to make on this process. Logically, we wanted to guage the easiest way of doing it given our needs and constraints, so we built a benchmark to evaluate this in response to plenty of criteria.
Criteria
While size was the initial reason we decided to go along with video encoding, we soon realized that there have been other facets to think about as well. After all, decoding time is a crucial one for machine learning applications as we wish to maximise the period of time spent training fairly than loading data. Quality must stays above a certain level as well in order to not degrade our policies’ performances. Lastly, one less obvious but equally essential aspect is the compatibility of our encoded videos in an effort to be easily decoded and played on the vast majority of media player, web browser, devices etc. Having the flexibility to simply and quickly visualize the content of any of our datasets was vital feature for us.
To summarize, these are the factors we desired to optimize:
- Size: Impacts storage disk space and download times.
- Decoding time: Impacts training time.
- Quality: Impacts training accuracy.
- Compatibility: Impacts the flexibility to simply decode the video and visualize it across devices and platforms.
Obviously, a few of these criteria are in direct contradiction: you possibly can hardly e.g. reduce the file size without degrading quality and vice versa. The goal was subsequently to search out the very best compromise overall.
Note that due to our specific use case and our needs, some encoding settings traditionally used for media consumption don’t really apply to us. A great example of that’s with GOP (Group of Pictures) size. More on that in a bit.
Metrics
Given those criteria, we selected metrics accordingly.
-
Size compression ratio (lower is healthier): as mentioned, that is the dimensions of the encoded video over the dimensions of its set of original, unencoded frames.
-
Load times ratio (lower is healthier): that is the time it take to decode a given frame from a video over the time it takes to load that frame from a person image.
For quality, we checked out 3 commonly used metrics:
-
Average Mean Square Error (lower is healthier): the common mean square error between each decoded frame and its corresponding original image over all requested timestamps, and in addition divided by the variety of pixels within the image to be comparable across different image sizes.
-
Average Peak Signal to Noise Ratio (higher is healthier): measures the ratio between the utmost possible power of a signal and the facility of corrupting noise that affects the fidelity of its representation. Higher PSNR indicates higher quality.
-
Average Structural Similarity Index Measure (higher is healthier): evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity.
Moreover, we tried various levels of encoding quality to get a way of what these metrics translate to visually. Nevertheless, video encoding is designed to appeal to the human eye by making the most of several principles of how the human visual perception works, tricking our brains to keep up a level of perceived quality. This might need a unique impact on a neural net. Due to this fact, besides these metrics and a visible check, it was essential for us to also validate that the encoding didn’t degrade our policies performance by A/B testing it.
For compatibility, we do not have a metric per se, but it surely mainly boils right down to the video codec and the pixel format. For the video codec, the three that we selected (h264, h265 and AV1) are common and do not pose a difficulty. Nevertheless, the pixel format is very important as well and we found afterwards that on most browsers as an illustration, yuv444p will not be supported and the video cannot be decoded.
Variables
Image content & size
We do not expect the identical optimal settings for a dataset of images from a simulation, or from the true world in an apartment, or in a factory, or outdoor, or with a lot of moving objects within the scene, etc. Similarly, loading times won’t vary linearly with the image size (resolution).
For these reasons, we ran this benchmark on 4 representative datasets:
lerobot/pusht_image: (96 x 96 pixels) simulation with easy geometric shapes, fixed camera.aliberts/aloha_mobile_shrimp_image: (480 x 640 pixels) real-world indoor, moving camera.aliberts/paris_street: (720 x 1280 pixels) real-world outdoor, moving camera.aliberts/kitchen: (1080 x 1920 pixels) real-world indoor, fixed camera.
Encoding parameters
We used FFmpeg for encoding our videos. Listed here are the important parameters we played with:
Video Codec (vcodec)
The codec (coder-decoder) is the algorithmic engine that is driving the video encoding. The codec defines a format used for encoding and decoding. Note that for a given codec, several implementations may exist. For instance for AV1: libaom (official implementation), libsvtav1 (faster, encoder only), libdav1d (decoder only).
Note that the remainder of the encoding parameters are interpreted in a different way depending on the video codec used. In other words, the identical crf value used with one codec doesn’t necessarily translate into the identical compression level with one other codec. The truth is, the default value (None) is not the identical amongst the various video codecs. Importantly, it’s also the case for a lot of other ffmpeg arguments like g which specifies the frequency of the important thing frames.
Pixel Format (pix_fmt)
Pixel format specifies each the color space (YUV, RGB, Grayscale) and, for YUV color space, the chroma subsampling which determines the way in which chrominance (color information) and luminance (brightness information) are literally stored within the resulting encoded bitstream. As an illustration, yuv420p indicates YUV color space with 4:2:0 chroma subsampling. That is probably the most common format for web video and standard playback. For RGB color space, this parameter specifies the variety of bits per pixel (e.g. rbg24 means RGB color space with 24 bits per pixel).
Group of Pictures size (g)
GOP (Group of Pictures) size determines how continuously keyframes are placed throughout the encoded bitstream. The lower that value is, the more continuously keyframes are placed. One key thing to know is that when requesting a frame at a given timestamp, unless that frame happens to be a keyframe itself, the decoder will search for the last previous keyframe before that timestamp and might want to decode each subsequent frame as much as the requested timestamp. Because of this increasing GOP size will increase the common decoding time of a frame as fewer keyframes can be found to begin from. For a typical online content akin to a video on Youtube or a movie on Netflix, a keyframe placed every 2 to 4 seconds of the video — 2s corresponding to a GOP size of 48 for a 24 fps video — will generally translate to a smooth viewer experience as this makes loading time acceptable for that use case (depending on hardware). For training a policy nevertheless, we’d like access to any frame as fast as possible meaning that we’ll probably need a much lower value of GOP.
Constant Rate Factor (crf)
The constant rate factor represent the quantity of lossy compression applied. A worth of 0 implies that no information is lost while a high value (around 50-60 depending on the codec used) could be very lossy.
Using this parameter fairly than specifying a goal bitrate is preferable because it allows to aim for a relentless visual quality level with a potentially variable bitrate fairly than the alternative.
This table summarizes the various values we tried for our study:
| parameter | values |
|---|---|
| vcodec | libx264, libx265, libsvtav1 |
| pix_fmt | yuv444p, yuv420p |
| g | 1, 2, 3, 4, 5, 6, 10, 15, 20, 40, None |
| crf | 0, 5, 10, 15, 20, 25, 30, 40, 50, None |
Decoding parameters
Decoder
We tested two video decoding backends from torchvision:
pyav(default)video_reader
Timestamps scenarios
Given the way in which video decoding works, once a keyframe has been loaded, the decoding of subsequent frames is fast.
This in fact is affected by the -g parameter during encoding, which specifies the frequency of the keyframes. Given our typical use cases in robotics policies which could request a couple of timestamps in numerous random places, we wish to duplicate these use cases with the next scenarios:
1_frame: 1 frame,2_frames: 2 consecutive frames (e.g.[t, t + 1 / fps]),6_frames: 6 consecutive frames (e.g.[t + i / fps for i in range(6)])
Note that this differs significantly from a typical use case like watching a movie, wherein every frame is loaded sequentially from the start to the top and it’s acceptable to have big values for -g.
Moreover, because some policies might request single timestamps which are a couple of frames apart, we even have the next scenario:
2_frames_4_space: 2 frames with 4 consecutive frames of spacing in between (e.g[t, t + 5 / fps]),
Nevertheless, on account of how video decoding is implemented with pyav, we do not have access to an accurate seek so in practice this scenario is actually the identical as 6_frames since all 6 frames between t and t + 5 / fps might be decoded.
Results
After running this study, we switched to a unique encoding from v1.6 on.
| codebase version | v1.5 | v1.6 |
|---|---|---|
| vcodec | libx264 |
libsvtav1 |
| pix-fmt | yuv444p |
yuv420p |
| g | 2 |
2 |
| crf | None (=23) |
30 |
We managed to realize more quality because of AV1 encoding while using the more compatible yuv420p pixel format.
Sizes
We achieved a mean compression ratio of about 14% across the full dataset sizes. Most of our datasets are reduced to under 40% of their original size, with some being lower than 1%. These variations might be attributed to the varied formats from which these datasets originate. Datasets with the best size reductions often contain uncompressed images, allowing the encoder’s temporal and spatial compression to drastically reduce their sizes. Then again, datasets where images were already stored using a type of spatial compression (akin to JPEG or PNG) have experienced less reduction in size. Other aspects, akin to image resolution, also affect the effectiveness of video compression.
Table 1: Dataset sizes comparison
| repo_id | raw | ours (v1.6) | ratio (ours/raw) |
|---|---|---|---|
| lerobot/nyu_rot_dataset | 5.3MB | 318.2KB | 5.8% |
| lerobot/pusht | 29.6MB | 7.5MB | 25.3% |
| lerobot/utokyo_saytap | 55.4MB | 6.5MB | 11.8% |
| lerobot/imperialcollege_sawyer_wrist_cam | 81.9MB | 3.8MB | 4.6% |
| lerobot/utokyo_xarm_bimanual | 138.5MB | 8.1MB | 5.9% |
| lerobot/unitreeh1_two_robot_greeting | 181.2MB | 79.0MB | 43.6% |
| lerobot/usc_cloth_sim | 254.5MB | 23.7MB | 9.3% |
| lerobot/unitreeh1_rearrange_objects | 283.3MB | 138.4MB | 48.8% |
| lerobot/tokyo_u_lsmo | 335.7MB | 22.8MB | 6.8% |
| lerobot/utokyo_pr2_opening_fridge | 360.6MB | 29.2MB | 8.1% |
| lerobot/aloha_static_pingpong_test | 480.9MB | 168.5MB | 35.0% |
| lerobot/cmu_franka_exploration_dataset | 602.3MB | 18.2MB | 3.0% |
| lerobot/unitreeh1_warehouse | 666.7MB | 236.9MB | 35.5% |
| lerobot/cmu_stretch | 728.1MB | 38.7MB | 5.3% |
| lerobot/asu_table_top | 737.6MB | 39.1MB | 5.3% |
| lerobot/xarm_push_medium | 808.5MB | 15.9MB | 2.0% |
| lerobot/xarm_push_medium_replay | 808.5MB | 17.8MB | 2.2% |
| lerobot/xarm_lift_medium_replay | 808.6MB | 18.4MB | 2.3% |
| lerobot/xarm_lift_medium | 808.6MB | 17.3MB | 2.1% |
| lerobot/utokyo_pr2_tabletop_manipulation | 829.4MB | 40.6MB | 4.9% |
| lerobot/utokyo_xarm_pick_and_place | 1.3GB | 54.6MB | 4.1% |
| lerobot/aloha_static_ziploc_slide | 1.3GB | 498.4MB | 37.2% |
| lerobot/ucsd_kitchen_dataset | 1.3GB | 46.5MB | 3.4% |
| lerobot/berkeley_gnm_cory_hall | 1.4GB | 85.6MB | 6.0% |
| lerobot/aloha_static_thread_velcro | 1.5GB | 1.1GB | 73.2% |
| lerobot/austin_buds_dataset | 1.5GB | 87.8MB | 5.7% |
| lerobot/aloha_static_screw_driver | 1.5GB | 507.8MB | 33.1% |
| lerobot/aloha_static_cups_open | 1.6GB | 486.3MB | 30.4% |
| lerobot/aloha_static_towel | 1.6GB | 565.3MB | 34.0% |
| lerobot/dlr_sara_grid_clamp | 1.7GB | 93.6MB | 5.5% |
| lerobot/unitreeh1_fold_clothes | 2.0GB | 922.0MB | 44.5% |
| lerobot/droid_100* | 2.0GB | 443.0MB | 21.2% |
| lerobot/aloha_static_battery | 2.3GB | 770.5MB | 33.0% |
| lerobot/aloha_static_tape | 2.5GB | 829.6MB | 32.5% |
| lerobot/aloha_static_candy | 2.6GB | 833.4MB | 31.5% |
| lerobot/conq_hose_manipulation | 2.7GB | 634.9MB | 23.4% |
| lerobot/columbia_cairlab_pusht_real | 2.8GB | 84.8MB | 3.0% |
| lerobot/dlr_sara_pour | 2.9GB | 153.1MB | 5.1% |
| lerobot/dlr_edan_shared_control | 3.1GB | 138.4MB | 4.4% |
| lerobot/aloha_static_vinh_cup | 3.1GB | 1.0GB | 32.3% |
| lerobot/aloha_static_vinh_cup_left | 3.5GB | 1.1GB | 32.1% |
| lerobot/ucsd_pick_and_place_dataset | 3.5GB | 125.8MB | 3.5% |
| lerobot/aloha_mobile_elevator | 3.7GB | 558.5MB | 14.8% |
| lerobot/aloha_mobile_shrimp | 3.9GB | 1.3GB | 34.6% |
| lerobot/aloha_mobile_wash_pan | 4.0GB | 1.1GB | 26.5% |
| lerobot/aloha_mobile_wipe_wine | 4.3GB | 1.2GB | 28.0% |
| lerobot/aloha_static_fork_pick_up | 4.6GB | 1.4GB | 31.6% |
| lerobot/berkeley_cable_routing | 4.7GB | 309.3MB | 6.5% |
| lerobot/aloha_static_coffee | 4.7GB | 1.5GB | 31.3% |
| lerobot/nyu_franka_play_dataset* | 5.2GB | 192.1MB | 3.6% |
| lerobot/aloha_static_coffee_new | 6.1GB | 1.9GB | 31.5% |
| lerobot/austin_sirius_dataset | 6.5GB | 428.7MB | 6.4% |
| lerobot/cmu_play_fusion | 6.7GB | 470.2MB | 6.9% |
| lerobot/berkeley_gnm_sac_son* | 7.0GB | 501.4MB | 7.0% |
| lerobot/aloha_mobile_cabinet | 7.0GB | 1.6GB | 23.2% |
| lerobot/nyu_door_opening_surprising_effectiveness | 7.1GB | 378.4MB | 5.2% |
| lerobot/aloha_mobile_chair | 7.4GB | 2.0GB | 27.2% |
| lerobot/berkeley_fanuc_manipulation | 8.9GB | 312.8MB | 3.5% |
| lerobot/jaco_play | 9.2GB | 411.1MB | 4.3% |
| lerobot/viola | 10.4GB | 873.6MB | 8.2% |
| lerobot/kaist_nonprehensile | 11.7GB | 203.1MB | 1.7% |
| lerobot/berkeley_mvp | 12.3GB | 127.0MB | 1.0% |
| lerobot/uiuc_d3field* | 15.8GB | 1.4GB | 9.1% |
| lerobot/umi_cup_in_the_wild | 16.8GB | 2.9GB | 17.6% |
| lerobot/aloha_sim_transfer_cube_human | 17.9GB | 66.7MB | 0.4% |
| lerobot/aloha_sim_insertion_scripted | 17.9GB | 67.6MB | 0.4% |
| lerobot/aloha_sim_transfer_cube_scripted | 17.9GB | 68.5MB | 0.4% |
| lerobot/berkeley_gnm_recon* | 18.7GB | 29.3MB | 0.2% |
| lerobot/austin_sailor_dataset | 18.8GB | 1.1GB | 6.0% |
| lerobot/utaustin_mutex | 20.8GB | 1.4GB | 6.6% |
| lerobot/aloha_static_pro_pencil | 21.1GB | 504.0MB | 2.3% |
| lerobot/aloha_sim_insertion_human | 21.5GB | 87.3MB | 0.4% |
| lerobot/stanford_kuka_multimodal_dataset | 32.0GB | 269.9MB | 0.8% |
| lerobot/berkeley_rpt | 40.6GB | 1.1GB | 2.7% |
| lerobot/roboturk* | 45.4GB | 1.9GB | 4.1% |
| lerobot/iamlab_cmu_pickup_insert | 50.3GB | 1.8GB | 3.6% |
| lerobot/stanford_hydra_dataset | 72.5GB | 2.9GB | 4.0% |
| lerobot/berkeley_autolab_ur5* | 76.4GB | 14.4GB | 18.9% |
| lerobot/stanford_robocook* | 124.6GB | 3.8GB | 3.1% |
| lerobot/toto | 127.7GB | 5.3GB | 4.1% |
| lerobot/fmb* | 356.5GB | 4.2GB | 1.2% |
*These datasets contain depth maps which weren’t included in our format.
Loading times
Because of video encoding, our loading times scale significantly better with the resolution. This is particularly true in advantageous scenarios where we decode multiple successive frames.
| 1 frame | 2 frames | 6 frames |
|---|---|---|
![]() |
![]() |
![]() |
Summary
The total results of our study can be found in this spreadsheet. The tables below show the averaged results for g=2 and crf=30, using backend=pyav and in all timestamps-modes (1_frame, 2_frames, 6_frames).
Table 2: Ratio of video size and pictures size (lower is healthier)
| repo_id | Mega Pixels | libx264 | libx265 | libsvtav1 | ||
|---|---|---|---|---|---|---|
| yuv420p | yuv444p | yuv420p | yuv444p | yuv420p | ||
| lerobot/pusht_image | 0.01 | 16.97% | 17.58% | 18.57% | 18.86% | 22.06% |
| aliberts/aloha_mobile_shrimp_image | 0.31 | 2.14% | 2.11% | 1.38% | 1.37% | 5.59% |
| aliberts/paris_street | 0.92 | 2.12% | 2.13% | 1.54% | 1.54% | 4.43% |
| aliberts/kitchen | 2.07 | 1.40% | 1.39% | 1.00% | 1.00% | 2.52% |
Table 3: Ratio of video and pictures loading times (lower is healthier)
| repo_id | Mega Pixels | libx264 | libx265 | libsvtav1 | ||
|---|---|---|---|---|---|---|
| yuv420p | yuv444p | yuv420p | yuv444p | yuv420p | ||
| lerobot/pusht_image | 0.01 | 25.04 | 29.14 | 4.16 | 4.66 | 4.52 |
| aliberts/aloha_mobile_shrimp_image | 0.31 | 63.56 | 58.18 | 1.60 | 2.04 | 1.00 |
| aliberts/paris_street | 0.92 | 3.89 | 3.76 | 0.51 | 0.71 | 0.48 |
| aliberts/kitchen | 2.07 | 2.68 | 1.94 | 0.36 | 0.58 | 0.38 |
Table 4: Quality (mse: lower is healthier, psnr & ssim: higher is healthier)
| repo_id | Mega Pixels | Values | libx264 | libx265 | libsvtav1 | ||
|---|---|---|---|---|---|---|---|
| yuv420p | yuv444p | yuv420p | yuv444p | yuv420p | |||
| lerobot/pusht_image | 0.01 | mse | 2.93E-04 | 2.09E-04 | 3.84E-04 | 3.02E-04 | 2.23E-04 |
| psnr | 35.42 | 36.97 | 35.06 | 36.69 | 37.12 | ||
| ssim | 98.29% | 98.83% | 98.17% | 98.69% | 98.70% | ||
| aliberts/aloha_mobile_shrimp_image | 0.31 | mse | 3.19E-04 | 3.02E-04 | 5.30E-04 | 5.17E-04 | 2.18E-04 |
| psnr | 35.80 | 36.10 | 35.01 | 35.23 | 39.83 | ||
| ssim | 95.20% | 95.20% | 94.51% | 94.56% | 97.52% | ||
| aliberts/paris_street | 0.92 | mse | 5.34E-04 | 5.16E-04 | 9.18E-03 | 9.17E-03 | 3.09E-04 |
| psnr | 33.55 | 33.75 | 29.96 | 30.06 | 35.41 | ||
| ssim | 93.94% | 93.93% | 83.11% | 83.11% | 95.50% | ||
| aliberts/kitchen | 2.07 | mse | 2.32E-04 | 2.06E-04 | 6.87E-04 | 6.75E-04 | 1.32E-04 |
| psnr | 36.77 | 37.38 | 35.27 | 35.50 | 39.20 | ||
| ssim | 95.47% | 95.58% | 95.11% | 95.13% | 96.84% | ||
Policies
We validated that this recent format didn’t impact performance on trained policies by training a few of them on our format. The performances of those policies were on par with those trained on the image versions.
Policies have also been trained and evaluated on AV1-encoded datasets and compared against our previous reference (h264):
- Diffusion on pusht:
- ACT on aloha_sim_transfer_cube_human:
- ACT on aloha_sim_insertion_scripted:
Future work
Video encoding/decoding is an enormous and complicated subject, and we’re only scratching the surface here. Listed here are among the things we left over on this experiment:
For the encoding, additional encoding parameters exist that aren’t included on this benchmark. Particularly:
-presetwhich allows for choosing encoding presets. This represents a set of options that can provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is taken into account to bemediumfor libx264 and libx265 and8for libsvtav1.-tunewhich allows to optimize the encoding for certain facets (e.g. film quality, live, etc.). Particularly, afast decodeoption is on the market to optimise the encoded bit stream for faster decoding.- two-pass encoding would even be interesting to take a look at because it increases quality, even though it is prone to increase encoding time significantly. Note that since we’re primarily occupied with decoding performance (as encoding is barely done once before uploading a dataset), we didn’t measure encoding times nor have any metrics regarding encoding. Using a 1-pass encoding didn’t pose any issue and it didn’t take a major period of time during this benchmark (with the condition of using
libsvtav1as a substitute oflibaomfor AV1 encoding).
The more detailed and comprehensive list of those parameters and others is on the market on the codecs documentations:
Similarly on the decoding side, other decoders exist but aren’t implemented in our current benchmark. To call a couple of:
torchcodectorchaudioffmpegiodecordnvc
Finally, we didn’t look into video encoding with depth maps. Although we did port datasets that include depth maps images, we aren’t using that modality for now.







