🤗 Transformers provides lots of the newest state-of-the-art (SoTA) models across domains and tasks. To get the most effective performance from these models, they must be optimized for inference speed and memory usage.
The 🤗 Hugging Face ecosystem offers precisely such ready & easy to make use of optimization tools that may be applied across the board to all of the models within the library. This makes it easy to reduce memory footprint and improve inference with just a number of extra lines of code.
On this hands-on tutorial, I’ll display how you may optimize Bark, a Text-To-Speech (TTS) model supported by 🤗 Transformers, based on three easy optimizations. These optimizations rely solely on the Transformers, Optimum and Speed up libraries from the 🤗 ecosystem.
This tutorial can be an indication of how one can benchmark a non-optimized model and its various optimizations.
For a more streamlined version of the tutorial
with fewer explanations but all of the code, see the accompanying Google Colab.
This blog post is organized as follows:
Table of Contents
- A reminder of Bark architecture
- An overview of various optimization techniques and their benefits
- A presentation of benchmark results
Bark Architecture
Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark. It’s able to generating a big selection of audio outputs, including speech, music, background noise, and easy sound effects. Moreover, it may produce nonverbal communication sounds comparable to laughter, sighs, and sobs.
Bark has been available in 🤗 Transformers since v4.31.0 onwards!
You’ll be able to mess around with Bark and discover it’s abilities here.
Bark is made from 4 foremost models:
BarkSemanticModel(also known as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.BarkCoarseModel(also known as the ‘coarse acoustics’ model): a causal autoregressive transformer, that takes as input the outcomes of theBarkSemanticModelmodel. It goals at predicting the primary two audio codebooks crucial for EnCodec.BarkFineModel(the ‘superb acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.- having predicted all of the codebook channels from the
EncodecModel, Bark uses it to decode the output audio array.
On the time of writing, two Bark checkpoints can be found, a smaller and a larger version.
Load the Model and its Processor
The pre-trained Bark small and huge checkpoints may be loaded from the pre-trained weights on the Hugging Face Hub. You’ll be able to change the repo-id with the checkpoint size that you simply wish to make use of.
We’ll default to the small checkpoint, to maintain it fast. But you may try the big checkpoint by utilizing "suno/bark" as a substitute of "suno/bark-small".
from transformers import BarkModel
model = BarkModel.from_pretrained("suno/bark-small")
Place the model to an accelerator device to get probably the most of the optimization techniques:
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
Load the processor, which is able to care for tokenization and optional speaker embeddings.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("suno/bark-small")
Optimization techniques
On this section, we’ll explore the way to use off-the-shelf features from the 🤗 Optimum and 🤗 Speed up libraries to optimize the Bark model, with minimal changes to the code.
Some set-ups
Let’s prepare the inputs and define a function to measure the latency and GPU memory footprint of the Bark generation method.
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt).to(device)
Measuring the latency and GPU memory footprint requires the usage of specific CUDA methods. We define a utility function that measures each the latency and GPU memory footprint of the model at inference time. To make sure we get an accurate picture of those metrics, we average over a specified variety of runs nb_loops:
import torch
from transformers import set_seed
def measure_latency_and_memory_use(model, inputs, nb_loops = 5):
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
torch.cuda.reset_peak_memory_stats(device)
torch.cuda.empty_cache()
torch.cuda.synchronize()
start_event.record()
for _ in range(nb_loops):
set_seed(0)
output = model.generate(**inputs, do_sample = True, fine_temperature = 0.4, coarse_temperature = 0.8)
end_event.record()
torch.cuda.synchronize()
max_memory = torch.cuda.max_memory_allocated(device)
elapsed_time = start_event.elapsed_time(end_event) * 1.0e-3
print('Execution time:', elapsed_time/nb_loops, 'seconds')
print('Max memory footprint', max_memory*1e-9, ' GB')
return output
Base case
Before incorporating any optimizations, let’s measure the performance of the baseline model and hearken to a generated example. We’ll benchmark the model over five iterations and report a median of the metrics:
with torch.inference_mode():
speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)
Output:
Execution time: 9.3841625 seconds
Max memory footprint 1.914612224 GB
Now, hearken to the output:
from IPython.display import Audio
sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
The output seems like this (download audio):
Essential note:
Here, the variety of iterations is definitely quite low. To accurately measure and compare results, one should increase it to not less than 100.
One among the foremost reasons for the importance of accelerating nb_loops is that the speech lengths generated vary greatly between different iterations, even with a hard and fast input.
One consequence of that is that the latency measured by measure_latency_and_memory_use may not actually reflect the actual performance of optimization techniques! The benchmark at the top of the blog post reports the outcomes averaged over 100 iterations, which provides a real indication of the performance of the model.
1. 🤗 Higher Transformer
Higher Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. Which means that certain model operations can be higher optimized on the GPU and that the model will ultimately be faster.
To be more specific, most models supported by 🤗 Transformers depend on attention, which allows them to selectively deal with certain parts of the input when generating output. This allows the models to effectively handle long-range dependencies and capture complex contextual relationships in the information.
The naive attention technique may be greatly optimized via a method called Flash Attention, proposed by the authors Dao et. al. in 2022.
Flash Attention is a faster and more efficient algorithm for attention computations that mixes traditional methods (comparable to tiling and recomputation) to attenuate memory usage and increase speed. Unlike previous algorithms, Flash Attention reduces memory usage from quadratic to linear in sequence length, making it particularly useful for applications where memory efficiency is essential.
Seems that Flash Attention is supported by 🤗 Higher Transformer out of the box! It requires one line of code to export the model to 🤗 Higher Transformer and enable Flash Attention:
model = model.to_bettertransformer()
with torch.inference_mode():
speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)
Output:
Execution time: 5.43284375 seconds
Max memory footprint 1.9151841280000002 GB
The output seems like this (download audio):
What does it bring to the table?
There is not any performance degradation, which implies you may get the exact same result as without this function, while gaining 20% to 30% in speed! Need to know more? See this blog post.
2. Half-precision
Most AI models typically use a storage format called single-precision floating point, i.e. fp32. What does it mean in practice? Each number is stored using 32 bits.
You’ll be able to thus decide to encode the numbers using 16 bits, with what is named half-precision floating point, i.e. fp16, and use half as much storage as before! Greater than that, you furthermore may get inference speed-up!
In fact, it also comes with small performance degradation since operations contained in the model won’t be as precise as using fp32.
You’ll be able to load a 🤗 Transformers model with half-precision by simpling adding torch_dtype=torch.float16 to the BarkModel.from_pretrained(...) line!
In other words:
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
with torch.inference_mode():
speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)
Output:
Execution time: 7.00045390625 seconds
Max memory footprint 2.7436124160000004 GB
The output seems like this (download audio):
What does it bring to the table?
With a slight degradation in performance, you profit from a memory footprint reduced by 50% and a speed gain of 5%.
3. CPU offload
As mentioned in the primary section of this booklet, Bark comprises 4 sub-models, that are called up sequentially during audio generation. In other words, while one sub-model is in use, the opposite sub-models are idle.
Why is that this an issue? GPU memory is precious in AI, since it’s where operations are fastest, and it’s often a bottleneck.
An easy solution is to unload sub-models from the GPU when inactive. This operation is named CPU offload.
Excellent news: CPU offload for Bark was integrated into 🤗 Transformers and you need to use it with just one line of code.
You simply must be certain 🤗 Speed up is installed!
model = BarkModel.from_pretrained("suno/bark-small")
model.enable_cpu_offload()
with torch.inference_mode():
speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)
Output:
Execution time: 8.97633828125 seconds
Max memory footprint 1.3231160320000002 GB
The output seems like this (download audio):
What does it bring to the table?
With a slight degradation in speed (10%), you profit from an enormous memory footprint reduction (60% 🤯).
With this feature enabled, bark-large footprint is now only 2GB as a substitute of 5GB.
That is the identical memory footprint as bark-small!
Want more? With fp16 enabled, it’s even right down to 1GB. We’ll see this in practice in the following section!
4. Mix
Let’s bring all of it together. The excellent news is that you could mix optimization techniques, which implies you need to use CPU offload, in addition to half-precision and 🤗 Higher Transformer!
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
model = BetterTransformer.transform(model, keep_original_model=False)
model.enable_cpu_offload()
with torch.inference_mode():
speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)
Output:
Execution time: 7.4496484375000005 seconds
Max memory footprint 0.46871091200000004 GB
The output seems like this (download audio):
What does it bring to the table?
Ultimately, you get a 23% speed-up and an enormous 80% memory saving!
Using batching
Want more?
Altogether, the three optimization techniques bring even higher results when batching.
Batching means combining operations for multiple samples to bring the general time spent generating the samples lower than generating sample per sample.
Here’s a quick example of how you need to use it:
text_prompt = [
"Let's try generating speech, with Bark, a text-to-speech model",
"Wow, batching is so great!",
"I love Hugging Face, it's so cool."]
inputs = processor(text_prompt).to(device)
with torch.inference_mode():
speech_output = model.generate(**inputs, do_sample = True, fine_temperature = 0.4, coarse_temperature = 0.8)
The output seems like this (download first, second, and last audio):
Benchmark results
As mentioned above, the little experiment we have carried out is an exercise in pondering and desires to be prolonged for a greater measure of performance. One also must warm up the GPU with a number of blank iterations before properly measuring performance.
Listed here are the outcomes of a 100-sample benchmark extending the measurements, using the big version of Bark.
The benchmark was run on an NVIDIA TITAN RTX 24GB with a maximum of 256 latest tokens.
Easy methods to read the outcomes?
Latency
It measures the duration of a single call to the generation method, no matter batch size.
In other words, it’s equal to .
A lower latency is preferred.
Maximum memory footprint
It measures the utmost memory used during a single call to the generation method.
A lower footprint is preferred.
Throughput
It measures the variety of samples generated per second. This time, the batch size is taken into consideration.
In other words, it’s equal to .
The next throughput is preferred.
No batching
Listed here are the outcomes with batch_size=1.
| Absolute values | Latency | Memory |
|---|---|---|
| no optimization | 10.48 | 5025.0M |
| bettertransformer only | 7.70 | 4974.3M |
| offload + bettertransformer | 8.90 | 2040.7M |
| offload + bettertransformer + fp16 | 8.10 | 1010.4M |
| Relative value | Latency | Memory |
|---|---|---|
| no optimization | 0% | 0% |
| bettertransformer only | -27% | -1% |
| offload + bettertransformer | -15% | -59% |
| offload + bettertransformer + fp16 | -23% | -80% |
Comment
As expected, CPU offload greatly reduces memory footprint while barely increasing latency.
Nevertheless, combined with bettertransformer and fp16, we get the most effective of each worlds, huge latency and memory decrease!
Batch size set to eight
And listed here are the benchmark results but with batch_size=8 and throughput measurement.
Note that since bettertransformer is a free optimization since it does the exact same operation and has the identical memory footprint because the non-optimized model while being faster, the benchmark was run with this optimization enabled by default.
| absolute values | Latency | Memory | Throghput |
|---|---|---|---|
| base case (bettertransformer) | 19.26 | 8329.2M | 0.42 |
| + fp16 | 10.32 | 4198.8M | 0.78 |
| + offload | 20.46 | 5172.1M | 0.39 |
| + offload + fp16 | 10.91 | 2619.5M | 0.73 |
| Relative value | Latency | Memory | Throughput |
|---|---|---|---|
| + base case (bettertransformer) | 0% | 0% | 0% |
| + fp16 | -46% | -50% | 87% |
| + offload | 6% | -38% | -6% |
| + offload + fp16 | -43% | -69% | 77% |
Comment
That is where we will see the potential of mixing all three optimization features!
The impact of fp16 on latency is less marked with batch_size = 1, but here it’s of enormous interest as it may reduce latency by almost half, and almost double throughput!
Concluding remarks
This blog post showcased a number of easy optimization tricks bundled within the 🤗 ecosystem. Using anyone of those techniques, or a mix of all three, can greatly improve Bark inference speed and memory footprint.
-
You should utilize the big version of Bark with none performance degradation and a footprint of just 2GB as a substitute of 5GB, 15% faster, using 🤗 Higher Transformer and CPU offload.
-
Do you favor high throughput? Batch by 8 with 🤗 Higher Transformer and half-precision.
-
You’ll be able to get the most effective of each worlds by utilizing fp16, 🤗 Higher Transformer and CPU offload!
