Quantizing OpenAI’s Whisper with the Huggingface Optimum Library → >30% Faster Inference, 64% Lower Memory tl;dr Introduction Step 1: Install requirements Step 2: Quantize the model Step 3: Compare original and quantized model Results

Artificial Intelligence

Quantizing OpenAI’s Whisper with the Huggingface Optimum Library → >30% Faster Inference, 64% Lower Memory tl;dr Introduction Step 1: Install requirements Step 2: Quantize the model Step 3: Compare original and quantized model Results

admin

May 19, 2023

Quantizing OpenAI’s Whisper with the Huggingface Optimum Library → >30% Faster Inference, 64% Lower Memory
tl;dr
Introduction
Step 1: Install requirements
Step 2: Quantize the model
Step 3: Compare original and quantized model
Results

Save 30% inference time and 64% memory when transcribing audio with OpenAI’s Whisper model by running the below code.

Get in contact with us for those who are inquisitive about learning more.

Memory savings achieved by quantizing whisper-large-v2.

With all of the being applicable to a broad range of knowledge, at first sight, it often looks like there isn’t any need to construct your individual model anymore. Nonetheless, in practice, this often seems to be an illusion!

Concretely, you can quickly face several issues reminiscent of:

The model has requirements, requiring expensive hardware, reminiscent of GPU instances, to run fast enough.
The model has a that is simply too high to your task.
The model is entirely , and you’re on the model provider’s mercy in terms of and parameters reminiscent of (e.g., ChatGPT API)
…

Depending in your use case and domain, the importance of those points will often vary. For instance, suppose you’re in an environment of microcontroller-based systems. In that case, you will certainly feel that pain from the start. In contrast, in other domains, it’s more about giving the user a or when scaling your service.

So what are some concrete ways to take care of these issues:

Training a totally recent model in your narrower domain
Quantization
Pruning
Low-rank approximation
Knowledge distillation / Teacher-Student approaches
Hardware-specific optimizations
…

As you’ll be able to tell, these techniques are fundamentally different in several dimensions, e.g., in the event that they or in the event that they require an without additional training.

And trust me after I inform you it’s value considering twice about and the way much it adds to the complexity of your overall pipeline! I recently thought it will be a great idea to enhance OpenAI’s whisper model via some knowledge distillation technique, and organising the training added a lot to the general . You won’t only need to jot down the training loop but in addition potentially onboard things like , , etc.

Below I arrange a swift example of methods to optimize the big version of OpenAI’s Whisper model (Huggingface Model Hub) by exporting it to format and running it in a by leveraging the features of Huggingface’s Optimum library.

!pip install -U optimum[exporters,onnxruntime] transformers torch

from pathlib import Path
from optimum.onnxruntime import (
AutoQuantizationConfig,
ORTModelForSpeechSeq2Seq,
ORTQuantizer
)# Configure base model and save directory for compressed model
model_id = "openai/whisper-large-v2"
save_dir = "whisper-large"
# Export model in ONNX
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_id, export=True)
model_dir = model.model_save_dir
# Run quantization for all ONNX files of exported model
onnx_models = list(Path(model_dir).glob("*.onnx"))
print(onnx_models)
quantizers = [ORTQuantizer.from_pretrained(model_dir, file_name=onnx_model) for onnx_model in onnx_models]
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
for quantizer in quantizers:
# Apply dynamic quantization and save the resulting model
quantizer.quantize(save_dir=save_dir, quantization_config=qconfig)

from datetime import datetime
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import pipeline, AutoTokenizer, AutoFeatureExtractor# Variety of inferences for comparing timings
num_inferences = 4
save_dir = "whisper-large"
inference_file = "test2.wav"
# Create pipeline based on quantized ONNX model
model = ORTModelForSpeechSeq2Seq.from_pretrained(save_dir)
tokenizer = AutoTokenizer.from_pretrained(save_dir)
feature_extractor = AutoFeatureExtractor.from_pretrained(save_dir)
cls_pipeline_onnx = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)
# Create pipeline with original model as baseline
cls_pipeline_original = pipeline("automatic-speech-recognition", model="openai/whisper-large-v2")
# Measure inference of quantized model
start_quantized = datetime.now()
for i in range(num_inferences):
cls_pipeline_onnx(inference_file)
end_quantized = datetime.now()
# Measure inference of original model
start_original = datetime.now()
for i in range(num_inferences):
cls_pipeline_original(inference_file)
end_original = datetime.now()
original_inference_time = (end_original - start_original).total_seconds() / num_inferences
print(f"Original inference time: {original_inference_time}")
quantized_inference_time = (end_quantized - start_quantized).total_seconds() / num_inferences
print(f"Quantized inference time: {quantized_inference_time}")

When running the quantized model on my machine (on CPU) it needs and is greater than while delivering comparable transcription results.

Note that this relies on naively doing inference on one sample with none batching or similar optimizations. Nonetheless, I feel it’s already impressive when considering that for the 30% time and 64% memory savings, you didn’t should do much on your individual and you can avoid organising any training loops or similar.

If you wish to learn more about these topics consider having a take a look at our workshops you could find under https://renumics.com/solutions/workshop-dcai/. Also, in fact be at liberty to only get in contact anytime to have a fast exchange.

Quantizing OpenAI’s Whisper with the Huggingface Optimum Library → >30% Faster Inference, 64% Lower Memory tl;dr Introduction Step 1: Install requirements Step 2: Quantize the model Step 3: Compare original and quantized model Results

1 COMMENT

LEAVE A REPLY Cancel reply