Update (29/08/2023): A benchmark on H100 was added to this blog post. Also, all performance numbers have been updated with newer versions of software.
Optimum Habana v1.7 on Habana Gaudi2 achieves x2.5 speedups in comparison with A100 and x1.4 in comparison with H100 when fine-tuning BridgeTower, a state-of-the-art vision-language model. This performance improvement relies on hardware-accelerated data loading to profit from your devices.
These techniques apply to another workloads constrained by data loading, which is often the case for a lot of varieties of vision models. This post will take you thru the method and benchmark we used to check BridgeTower fine-tuning on Habana Gaudi2, Nvidia H100 and Nvidia A100 80GB. It also demonstrates how easy it’s to reap the benefits of these features in transformers-based models.
BridgeTower
Within the recent past, Vision-Language (VL) models have gained tremendous importance and shown dominance in quite a lot of VL tasks. Commonest approaches leverage uni-modal encoders to extract representations from their respective modalities. Then those representations are either fused together, or fed right into a cross-modal encoder. To efficiently handle a number of the performance limitations and restrictions in VL representation learning, BridgeTower introduces multiple bridge layers that construct a connection between the highest layers of uni-modal encoders and every layer of the cross-modal encoder. This permits effective bottom-up cross-modal alignment and fusion between visual and textual representations at different semantic levels within the cross-modal encoder.
Pre-trained with only 4M images (see the detail below), BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. Specifically, BridgeTower achieves an accuracy of 78.73% on the VQAv2 test-std set, outperforming the previous state-of-the-art model (METER) by 1.09% using the identical pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models which can be pre-trained on orders-of-magnitude larger datasets.
Hardware
NVIDIA H100 Tensor Core GPU is the most recent and fastest generation of Nvidia GPUs. It features a dedicated Transformer Engine that permits to perform fp8 mixed-precision runs. One device has 80GB of memory.
Nvidia A100 Tensor Core GPU includes the third generation of the Tensor Core technology. This remains to be the fastest GPU that you can see at most cloud providers. We use here the 80GB-memory variant which also offers faster memory bandwidth than the 40GB one.
Habana Gaudi2 is the second-generation AI hardware accelerator designed by Habana Labs. A single server incorporates 8 accelerator devices called HPUs with 96GB of memory each. Take a look at our previous blog post for a more in-depth introduction and a guide showing the way to access it through the Intel Developer Cloud. Unlike many AI accelerators out there, advanced features are very easy to use to profit from Gaudi2 with Optimum Habana, which enables users to port Transformers-compatible scripts to Gaudi with only a 2-line change.
Benchmark
To benchmark training, we’re going to fine-tune a BridgeTower Large checkpoint consisting of 866M parameters. This checkpoint was pretrained on English language using masked language modeling, image-text matching and image-text contrastive loss on Conceptual Captions, SBU Captions, MSCOCO Captions and Visual Genome.
We are going to further fine-tune this checkpoint on the Recent Yorker Caption Contest dataset which consists of cartoons from The Recent Yorker and probably the most voted captions.
Hyperparameters are the identical for all accelerators. We used a batch size of 48 samples for every device. You’ll be able to check hyperparameters out here for Gaudi2 and there for A100.
When coping with datasets involving images, data loading is often a bottleneck because many costly operations are computed on CPU (image decoding, image augmentations) after which full images are sent to the training devices. Ideally, we would really like to send only raw bytes to devices after which perform decoding and various image transformations on device. But let’s have a look at first the way to easily allocate more resources to data loading for accelerating your runs.
Making use of dataloader_num_workers
When image loading is finished on CPU, a fast solution to speed it up could be to allocate more subprocesses for data loading. This may be very easy to do with Transformers’ TrainingArguments (or its Optimum Habana counterpart GaudiTrainingArguments): you need to use the dataloader_num_workers=N argument to set the variety of subprocesses (N) allocated on CPU for data loading.
The default is 0, which implies that data is loaded within the primary process. This will likely not be optimal because the primary process has many things to administer. We are able to set it to 1 to have one fully dedicated subprocess for data loading. When several subprocesses are allocated, each certainly one of them shall be answerable for preparing a batch. Because of this RAM consumption will increase with the variety of staff. One suggestion could be to set it to the variety of CPU cores, but those cores is probably not fully free so you should have to try it out to seek out the most effective configuration.
Let’s run the three following experiments:
- a mixed-precision (bfloat16/float32) run distributed across 8 devices where data loading is performed by the identical process as all the things else (i.e.
dataloader_num_workers=0) - a mixed-precision (bfloat16/float32) run distributed across 8 devices with 1 dedicated subprocess for data loading (i.e.
dataloader_num_workers=1) - same run with
dataloader_num_workers=2
Listed here are the throughputs we got on Gaudi2, H100 and A100:
| Device | dataloader_num_workers=0 |
dataloader_num_workers=1 |
dataloader_num_workers=2 |
|---|---|---|---|
| Gaudi2 HPU | 601.5 samples/s | 747.4 samples/s | 768.7 samples/s |
| H100 GPU | 336.5 samples/s | 580.1 samples/s | 602.1 samples/s |
| A100 GPU | 227.5 samples/s | 339.7 samples/s | 345.4 samples/s |
We first see that Gaudi2 is x1.28 faster than H100 with dataloader_num_workers=2, x1.29 faster with dataloader_num_workers=1 and x1.79 faster with dataloader_num_workers=0. Gaudi2 can be much faster than the previous generation because it is x2.23 faster than A100 with dataloader_num_workers=2, x2.20 faster with dataloader_num_workers=1 and x2.64 faster with dataloader_num_workers=0, which is even higher than the speedups we previously reported!
Second, we see that allocating more resources for data loading can result in easy speedups: x1.28 on Gaudi2, x1.79 on H100 and x1.52 on A100.
We also ran experiments with several dedicated subprocesses for data loading but performance was not higher than with dataloader_num_workers=2 for all accelerators.
Thus, using dataloader_num_workers>0 is generally a superb first way of accelerating your runs involving images!
Tensorboard logs might be visualized here for Gaudi2 and there for A100.
Hardware-accelerated data loading with Optimum Habana
For even larger speedups, we are actually going to maneuver as many data loading operations as possible from the CPU to the accelerator devices (i.e. HPUs on Gaudi2 or GPUs on A100/H100). This might be done on Gaudi2 using Habana’s media pipeline.
Given a dataset, most dataloaders follow the next recipe:
- Fetch data (e.g. where your JPEG images are stored on disk)
- The CPU reads encoded images
- The CPU decodes images
- The CPU applies image transformations to enhance images
- Finally, images are sent to devices (although this is generally not done by the dataloader itself)
As a substitute of doing the entire process on CPU and send ready-to-train data to devices, a more efficient workflow could be to send encoded images to devices first after which perform image decoding and augmentations:
- Same as before
- Same as before
- Encoded images are sent to devices
- Devices decode images
- Devices apply image transformations to enhance images
That way we are able to profit from the computing power of our devices to hurry image decoding and transformations up.
Note that there are two caveats to concentrate on when doing this:
- Device memory consumption will increase, so you could have to cut back your batch size if there is just not enough free memory. This will likely mitigate the speedup brought by this approach.
- If devices are intensively used (100% or near it) when doing data loading on CPU, don’t expect any speedup when doing it on devices as they have already got their hands full.
To implement this on Gaudi2, now we have got you covered: the contrastive image-text example in Optimum Habana now provides a ready-to-use media pipeline that you may use with COCO-like datasets that contain text and pictures! You’ll just should add --mediapipe_dataloader to your command to make use of it.
For interested readers, a lower-level overview is given within the documentation of Gaudi here and the list of all supported operators is out there there.
We are actually going to re-run the previous experiments adding the mediapipe_dataloader argument because it is compatible with dataloader_num_workers:
| Device | dataloader_num_workers=0 |
dataloader_num_workers=2 |
dataloader_num_workers=2 + mediapipe_dataloader |
|---|---|---|---|
| Gaudi2 HPU | 601.5 samples/s | 768.7 samples/s | 847.7 samples/s |
| H100 GPU | 336.5 samples/s | 602.1 samples/s | / |
| A100 GPU | 227.5 samples/s | 345.4 samples/s | / |
We got an extra x1.10 speedup in comparison with the previous run with dataloader_num_workers=2 only.
This final run is thus x1.41 faster than our base run on Gaudi2 simply adding 2 ready-to-use training arguments. It’s also x1.41 faster than H100 and x2.45 faster than A100 with dataloader_num_workers=2!
Reproducing this benchmark
To breed this benchmark, you first have to get access to Gaudi2 through the Intel Developer Cloud (see this guide for more information).
Then, you want to install the most recent version of Optimum Habana and run run_bridgetower.py which you will discover here. Here is the way to do it:
pip install optimum[habana]
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/contrastive-image-text
pip install -r requirements.txt
The bottom command line to run the script is:
python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py
--output_dir /tmp/bridgetower-test
--model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc
--dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching
--dataset_revision 3c6c4f6c0ff7e902833d3afa5f8f3875c2b036e6
--image_column image --caption_column image_description
--remove_unused_columns=False
--do_train --do_eval --do_predict
--per_device_train_batch_size="40" --per_device_eval_batch_size="16"
--num_train_epochs 5
--learning_rate="1e-5"
--push_to_hub --report_to tensorboard --hub_model_id bridgetower
--overwrite_output_dir
--use_habana --use_lazy_mode --use_hpu_graphs_for_inference --gaudi_config_name Habana/clip
--throughput_warmup_steps 3
--logging_steps 10
which corresponds to the case --dataloader_num_workers 0. You’ll be able to then add --dataloader_num_workers N and --mediapipe_dataloader to check other configurations.
To push your model and Tensorboard logs to the Hugging Face Hub, you should have to log in to your account beforehand with:
huggingface-cli login
For A100 and H100, you need to use the identical run_bridgetower.py script with a couple of small changes:
- Replace
GaudiTrainerandGaudiTrainingArgumentswithTrainerandTrainingArgumentsfrom Transformers - Remove references to
GaudiConfig,gaudi_configandHabanaDataloaderTrainer - Import
set_seeddirectly from Transformers:from transformers import set_seed
The outcomes displayed on this benchmark were obtained with a Nvidia H100 Lambda instance and a Nvidia A100 80GB GCP instance each with 8 devices using Nvidia’s Docker images.
Note that --mediapipe_dataloader is compatible with Gaudi2 only and is not going to work with A100/H100.
Regarding fp8 results on H100 using Transformer Engine, they are usually not available since the code crashes and would require modifying the modeling of BridgeTower in Transformers. We are going to revisit this comparison when fp8 is supported on Gaudi2.
Conclusion
When coping with images, we presented two solutions to hurry up your training workflows: allocating more resources to the dataloader, and decoding and augmenting images directly on accelerator devices relatively than on CPU.
We showed that it results in dramatic speedups when training a SOTA vision-language model like BridgeTower: Habana Gaudi2 with Optimum Habana is about x1.4 faster than Nvidia H100 and x2.5 faster than Nvidia A100 80GB with Transformers!
And that is super easy to make use of as you only need to supply a couple of additional training arguments.
To go further, we’re looking forward to using HPU graphs for training models even faster and to presenting the way to use DeepSpeed ZeRO-3 on Gaudi2 to speed up the training of your LLMs. Stay tuned!
Should you are occupied with accelerating your Machine Learning training and inference workflows using the most recent AI hardware accelerators and software libraries, take a look at our Expert Acceleration Program. To learn more about Habana solutions, examine our partnership and get in touch with them here. To learn more about Hugging Face efforts to make AI hardware accelerators easy to make use of, take a look at our Hardware Partner Program.
