Optimizing Bark using 🤗 Transformers

-


Yoach Lacombe's avatar




🤗 Transformers provides lots of the newest state-of-the-art (SoTA) models across domains and tasks. To get the most effective performance from these models, they must be optimized for inference speed and memory usage.

The 🤗 Hugging Face ecosystem offers precisely such ready & easy to make use of optimization tools that may be applied across the board to all of the models within the library. This makes it easy to reduce memory footprint and improve inference with just a number of extra lines of code.

On this hands-on tutorial, I’ll display how you may optimize Bark, a Text-To-Speech (TTS) model supported by 🤗 Transformers, based on three easy optimizations. These optimizations rely solely on the Transformers, Optimum and Speed up libraries from the 🤗 ecosystem.

This tutorial can be an indication of how one can benchmark a non-optimized model and its various optimizations.

For a more streamlined version of the tutorial
with fewer explanations but all of the code, see the accompanying Google Colab.

This blog post is organized as follows:



Table of Contents

  1. A reminder of Bark architecture
  2. An overview of various optimization techniques and their benefits
  3. A presentation of benchmark results



Bark Architecture

Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark. It’s able to generating a big selection of audio outputs, including speech, music, background noise, and easy sound effects. Moreover, it may produce nonverbal communication sounds comparable to laughter, sighs, and sobs.

Bark has been available in 🤗 Transformers since v4.31.0 onwards!

You’ll be able to mess around with Bark and discover it’s abilities here.

Bark is made from 4 foremost models:

  • BarkSemanticModel (also known as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
  • BarkCoarseModel (also known as the ‘coarse acoustics’ model): a causal autoregressive transformer, that takes as input the outcomes of the BarkSemanticModel model. It goals at predicting the primary two audio codebooks crucial for EnCodec.
  • BarkFineModel (the ‘superb acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
  • having predicted all of the codebook channels from the EncodecModel, Bark uses it to decode the output audio array.

On the time of writing, two Bark checkpoints can be found, a smaller and a larger version.



Load the Model and its Processor

The pre-trained Bark small and huge checkpoints may be loaded from the pre-trained weights on the Hugging Face Hub. You’ll be able to change the repo-id with the checkpoint size that you simply wish to make use of.

We’ll default to the small checkpoint, to maintain it fast. But you may try the big checkpoint by utilizing "suno/bark" as a substitute of "suno/bark-small".

from transformers import BarkModel

model = BarkModel.from_pretrained("suno/bark-small")

Place the model to an accelerator device to get probably the most of the optimization techniques:

import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

Load the processor, which is able to care for tokenization and optional speaker embeddings.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("suno/bark-small")



Optimization techniques

On this section, we’ll explore the way to use off-the-shelf features from the 🤗 Optimum and 🤗 Speed up libraries to optimize the Bark model, with minimal changes to the code.



Some set-ups

Let’s prepare the inputs and define a function to measure the latency and GPU memory footprint of the Bark generation method.

text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt).to(device)

Measuring the latency and GPU memory footprint requires the usage of specific CUDA methods. We define a utility function that measures each the latency and GPU memory footprint of the model at inference time. To make sure we get an accurate picture of those metrics, we average over a specified variety of runs nb_loops:

import torch
from transformers import set_seed


def measure_latency_and_memory_use(model, inputs, nb_loops = 5):

  
  start_event = torch.cuda.Event(enable_timing=True)
  end_event = torch.cuda.Event(enable_timing=True)

  
  torch.cuda.reset_peak_memory_stats(device)
  torch.cuda.empty_cache()
  torch.cuda.synchronize()

  
  start_event.record()

  
  for _ in range(nb_loops):
        
        set_seed(0)
        output = model.generate(**inputs, do_sample = True, fine_temperature = 0.4, coarse_temperature = 0.8)

  
  end_event.record()
  torch.cuda.synchronize()

  
  max_memory = torch.cuda.max_memory_allocated(device)
  elapsed_time = start_event.elapsed_time(end_event) * 1.0e-3

  print('Execution time:', elapsed_time/nb_loops, 'seconds')
  print('Max memory footprint', max_memory*1e-9, ' GB')

  return output



Base case

Before incorporating any optimizations, let’s measure the performance of the baseline model and hearken to a generated example. We’ll benchmark the model over five iterations and report a median of the metrics:


with torch.inference_mode():
  speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)

Output:

Execution time: 9.3841625 seconds
Max memory footprint 1.914612224  GB

Now, hearken to the output:

from IPython.display import Audio


sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The output seems like this (download audio):



Essential note:

Here, the variety of iterations is definitely quite low. To accurately measure and compare results, one should increase it to not less than 100.

One among the foremost reasons for the importance of accelerating nb_loops is that the speech lengths generated vary greatly between different iterations, even with a hard and fast input.

One consequence of that is that the latency measured by measure_latency_and_memory_use may not actually reflect the actual performance of optimization techniques! The benchmark at the top of the blog post reports the outcomes averaged over 100 iterations, which provides a real indication of the performance of the model.



1. 🤗 Higher Transformer

Higher Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. Which means that certain model operations can be higher optimized on the GPU and that the model will ultimately be faster.

To be more specific, most models supported by 🤗 Transformers depend on attention, which allows them to selectively deal with certain parts of the input when generating output. This allows the models to effectively handle long-range dependencies and capture complex contextual relationships in the information.

The naive attention technique may be greatly optimized via a method called Flash Attention, proposed by the authors Dao et. al. in 2022.

Flash Attention is a faster and more efficient algorithm for attention computations that mixes traditional methods (comparable to tiling and recomputation) to attenuate memory usage and increase speed. Unlike previous algorithms, Flash Attention reduces memory usage from quadratic to linear in sequence length, making it particularly useful for applications where memory efficiency is essential.

Seems that Flash Attention is supported by 🤗 Higher Transformer out of the box! It requires one line of code to export the model to 🤗 Higher Transformer and enable Flash Attention:

model =  model.to_bettertransformer()

with torch.inference_mode():
  speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)

Output:

Execution time: 5.43284375 seconds
Max memory footprint 1.9151841280000002  GB

The output seems like this (download audio):

What does it bring to the table?

There is not any performance degradation, which implies you may get the exact same result as without this function, while gaining 20% to 30% in speed! Need to know more? See this blog post.



2. Half-precision

Most AI models typically use a storage format called single-precision floating point, i.e. fp32. What does it mean in practice? Each number is stored using 32 bits.

You’ll be able to thus decide to encode the numbers using 16 bits, with what is named half-precision floating point, i.e. fp16, and use half as much storage as before! Greater than that, you furthermore may get inference speed-up!

In fact, it also comes with small performance degradation since operations contained in the model won’t be as precise as using fp32.

You’ll be able to load a 🤗 Transformers model with half-precision by simpling adding torch_dtype=torch.float16 to the BarkModel.from_pretrained(...) line!

In other words:

model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)

with torch.inference_mode():
  speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)

Output:

Execution time: 7.00045390625 seconds
Max memory footprint 2.7436124160000004  GB

The output seems like this (download audio):

What does it bring to the table?

With a slight degradation in performance, you profit from a memory footprint reduced by 50% and a speed gain of 5%.



3. CPU offload

As mentioned in the primary section of this booklet, Bark comprises 4 sub-models, that are called up sequentially during audio generation. In other words, while one sub-model is in use, the opposite sub-models are idle.

Why is that this an issue? GPU memory is precious in AI, since it’s where operations are fastest, and it’s often a bottleneck.

An easy solution is to unload sub-models from the GPU when inactive. This operation is named CPU offload.

Excellent news: CPU offload for Bark was integrated into 🤗 Transformers and you need to use it with just one line of code.

You simply must be certain 🤗 Speed up is installed!

model = BarkModel.from_pretrained("suno/bark-small")


model.enable_cpu_offload()

with torch.inference_mode():
  speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)

Output:

Execution time: 8.97633828125 seconds
Max memory footprint 1.3231160320000002  GB

The output seems like this (download audio):

What does it bring to the table?

With a slight degradation in speed (10%), you profit from an enormous memory footprint reduction (60% 🤯).

With this feature enabled, bark-large footprint is now only 2GB as a substitute of 5GB.
That is the identical memory footprint as bark-small!

Want more? With fp16 enabled, it’s even right down to 1GB. We’ll see this in practice in the following section!



4. Mix

Let’s bring all of it together. The excellent news is that you could mix optimization techniques, which implies you need to use CPU offload, in addition to half-precision and 🤗 Higher Transformer!


model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)


model = BetterTransformer.transform(model, keep_original_model=False)


model.enable_cpu_offload()

with torch.inference_mode():
  speech_output = measure_latency_and_memory_use(model, inputs, nb_loops = 5)

Output:

Execution time: 7.4496484375000005 seconds
Max memory footprint 0.46871091200000004  GB

The output seems like this (download audio):

What does it bring to the table?

Ultimately, you get a 23% speed-up and an enormous 80% memory saving!



Using batching

Want more?

Altogether, the three optimization techniques bring even higher results when batching.
Batching means combining operations for multiple samples to bring the general time spent generating the samples lower than generating sample per sample.

Here’s a quick example of how you need to use it:

text_prompt = [
    "Let's try generating speech, with Bark, a text-to-speech model",
    "Wow, batching is so great!",
    "I love Hugging Face, it's so cool."]

inputs = processor(text_prompt).to(device)


with torch.inference_mode():
  
  speech_output = model.generate(**inputs, do_sample = True, fine_temperature = 0.4, coarse_temperature = 0.8)

The output seems like this (download first, second, and last audio):



Benchmark results

As mentioned above, the little experiment we have carried out is an exercise in pondering and desires to be prolonged for a greater measure of performance. One also must warm up the GPU with a number of blank iterations before properly measuring performance.

Listed here are the outcomes of a 100-sample benchmark extending the measurements, using the big version of Bark.

The benchmark was run on an NVIDIA TITAN RTX 24GB with a maximum of 256 latest tokens.



Easy methods to read the outcomes?



Latency

It measures the duration of a single call to the generation method, no matter batch size.

In other words, it’s equal to elapsedTimenbLoopsfrac{elapsedTime}{nbLoops}

A lower latency is preferred.


Maximum memory footprint

It measures the utmost memory used during a single call to the generation method.

A lower footprint is preferred.



Throughput

It measures the variety of samples generated per second. This time, the batch size is taken into consideration.

In other words, it’s equal to nbLoopsbatchSizeelapsedTimefrac{nbLoops*batchSize}{elapsedTime}

The next throughput is preferred.



No batching

Listed here are the outcomes with batch_size=1.

Absolute values Latency Memory
no optimization 10.48 5025.0M
bettertransformer only 7.70 4974.3M
offload + bettertransformer 8.90 2040.7M
offload + bettertransformer + fp16 8.10 1010.4M
Relative value Latency Memory
no optimization 0% 0%
bettertransformer only -27% -1%
offload + bettertransformer -15% -59%
offload + bettertransformer + fp16 -23% -80%


Comment

As expected, CPU offload greatly reduces memory footprint while barely increasing latency.

Nevertheless, combined with bettertransformer and fp16, we get the most effective of each worlds, huge latency and memory decrease!



Batch size set to eight

And listed here are the benchmark results but with batch_size=8 and throughput measurement.

Note that since bettertransformer is a free optimization since it does the exact same operation and has the identical memory footprint because the non-optimized model while being faster, the benchmark was run with this optimization enabled by default.

absolute values Latency Memory Throghput
base case (bettertransformer) 19.26 8329.2M 0.42
+ fp16 10.32 4198.8M 0.78
+ offload 20.46 5172.1M 0.39
+ offload + fp16 10.91 2619.5M 0.73
Relative value Latency Memory Throughput
+ base case (bettertransformer) 0% 0% 0%
+ fp16 -46% -50% 87%
+ offload 6% -38% -6%
+ offload + fp16 -43% -69% 77%


Comment

That is where we will see the potential of mixing all three optimization features!

The impact of fp16 on latency is less marked with batch_size = 1, but here it’s of enormous interest as it may reduce latency by almost half, and almost double throughput!


Concluding remarks

This blog post showcased a number of easy optimization tricks bundled within the 🤗 ecosystem. Using anyone of those techniques, or a mix of all three, can greatly improve Bark inference speed and memory footprint.

  • You should utilize the big version of Bark with none performance degradation and a footprint of just 2GB as a substitute of 5GB, 15% faster, using 🤗 Higher Transformer and CPU offload.

  • Do you favor high throughput? Batch by 8 with 🤗 Higher Transformer and half-precision.

  • You’ll be able to get the most effective of each worlds by utilizing fp16, 🤗 Higher Transformer and CPU offload!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x