Today we’re pleased to introduce a brand new blazing fast OpenAI Whisper deployment option on Inference Endpoints. It provides as much as 8x performance improvements in comparison with the previous version, and makes everyone one click away from deploying dedicated, powerful transcription models in a cheap way, leveraging the amazing work done by the AI community.
Through this release, we would really like to make Inference Endpoints more community-centric and permit anyone to come back and contribute to create incredible inference deployments on the Hugging Face Platform. Together with the community, we would really like to propose optimized deployments for a big selection of tasks through the usage of awesome and available open-source technologies.
The unique position of Hugging Face, at the guts of the Open-Source AI Community, working hand-in-hand with individuals, institutions and industrial partners, makes it probably the most heterogeneous platform in relation to deploying AI models for inference on a wide range of hardware and software.
Inference Stack
The brand new Whisper endpoint leverages amazing open-source community projects. Inference is powered by the vLLM project, which provides efficient ways of running AI models on various hardware families – especially, but not limited to, NVIDIA GPUs. We use the vLLM implementation of OpenAI’s Whisper model, allowing us to enable further, lower-level optimizations down the software stack.
On this initial release, we’re targeting NVIDIA GPUs with compute capabilities 8.9 or higher (Ada Lovelace), like L4 & L40s, which unlocks a big selection of software optimizations:
- PyTorch compilation (torch.compile)
- CUDA graphs
- float8 KV cache
Compilation withtorch.compilegenerates optimized kernels in a Just-In-Time (JIT) fashion, which might modify the computational graph, reorder operations, call specialized methods, and more.
CUDA graphs record the flow of sequential operations, or kernels, happening on the GPU, and attempts to group them as larger chunks of labor units to execute on the GPU. This grouping operation reduces data movements, synchronizations, and GPU scheduling overhead by executing a single, much larger work unit, somewhat than multiple smaller ones.
Last but not least, we’re dynamically quantizing activations to cut back the memory requirement incurred by the KV cache(s). The computations are done in half precision, bfloat16 on this case, and the outputs are being stored in reduced precision (1 byte for float8 vs 2 bytes for bfloat16) which allows us to store more elements within the KV cache, increasing the cache hit rate.
There are lots of ways to proceed pushing this and we’re gearing as much as work hand in hand with the community to enhance it!
Benchmarks
Whisper Large V3 shows a virtually 8x improvement in RTFx, enabling much faster inference with no loss in transcription quality.
We evaluated the transcription quality and runtime efficiency of several Whisper-based models—Whisper Large V3, Whisper Large V3-Turbo, and Distil-Whisper Large V3.5—and compared them against their implementations on the Transformers library to evaluate each accuracy and decoding speed under an identical conditions.
We computed Word Error Rate (WER) across 8 standard datasets from the Open ASR Leaderboard, including AMI, GigaSpeech, LibriSpeech (Clean and Other), SPGISpeech, Tedlium, VoxPopuli, and Earnings22. These datasets span diverse domains and recording conditions, ensuring a sturdy evaluation of generalization and real-world transcription quality. WER measures transcription accuracy by calculating the proportion of words which can be incorrectly predicted (via insertions, deletions, or substitutions); lower WER indicates higher performance. All three Whisper variants maintain WER performance comparable to their Transformer baselines.

To evaluate inference efficiency, we sampled from the rev16 long-form dataset, which comprises audio segments over 45 minutes in length—representative of real transcription workloads corresponding to meetings, podcasts, or interviews. We measured the Real-Time Factor (RTFx), defined because the ratio of audio duration to transcription time, and averaged it across samples. All models were evaluated in bfloat16 on a single L4 GPU, using consistent decoding settings (language, beam size, and batch size).

Methods to deploy
You may deploy your individual ASR inference pipeline via Hugging Face Endpoints. Endpoints allows anyone willing to deploy AI models into production ready environments to accomplish that by filling in a number of parameters.
It also features probably the most complete fleet of AI hardware available in the marketplace to fit your need for cost and performance.
All of this directly from where the AI community is being built.
To start, nothing easier, simply select the model you desire to deploy:
Inference
Running inference on the deployed model endpoint will be done in only a number of lines of code in Python, you too can use the identical structure in Javascript or some other language you are comfortable with.
Here’s a small snippet to check the deployed checkpoint quickly.
import requests
ENDPOINT_URL = "https://.cloud/api/v1/audio/transcriptions"
HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
AUDIO_FILE = "sample.wav"
headers = {"Authorization": f"Bearer {HF_TOKEN}"}
with open(AUDIO_FILE, "rb") as f:
files = {"file": f.read()}
response = requests.post(ENDPOINT_URL, headers=headers, files=files)
response.raise_for_status()
print("Transcript:", response.json()["text"])
FastRTC Demo
With this blazing fast endpoint, it’s possible to construct real-time transcription apps. Check out this example built with FastRTC. Simply speak into your microphone and see your speech transcribed in real time!
Spaces can easily be duplicated so please be at liberty to duplicate away. All the above is made available for community use on the Hugging Face Hub in our dedicated HF Endpoints organization. Open issues, suggest use-cases and contribute here: hfendpoints-images (Inference Endpoints Images) 🚀
