Trends and Insights with Recent Multilingual & Long-Form Tracks

-



While everyone (and their grandma 👵) is spinning up recent ASR models, picking the correct one to your use case can feel more overwhelming than selecting your next Netflix show. As of 21 Nov 2025, there are 150 Audio-Text-to-Text and 27K ASR models on the Hub 🤯

Most benchmarks give attention to short-form English transcription (<30s), and overlook other essential tasks, similar to (1) multilingual performance and (2) model throughput, which may a be deciding factor for long-form audio like meetings and podcasts.

Over the past two years, the Open ASR Leaderboard has grow to be a regular for comparing open and closed-source models on each accuracy and efficiency. Recently, multilingual and long-form transcription tracks have been added to the leaderboard 🎉



TL;DR – Open ASR Leaderboard

  • 📝 Recent preprint on ASR trends from the leaderboard: https://hf.co/papers/2510.06961
  • 🧠 Best accuracy: Conformer encoder + LLM decoders (open-source ftw 🥳)
  • Fastest: CTC / TDT decoders
  • 🌍 Multilingual: Comes at the associated fee of single-language performance
  • Long-form: Closed-source systems still lead (for now 😉)
  • 🧑‍💻 High-quality-tuning guides (Parakeet, Voxtral, Whisper): to proceed pushing performance

As of 21 Nov 2025, the Open ASR Leaderboard compares 60+ open and closed-source models from 18 organizations, across 11 datasets.

In a recent preprint, we dive into the technical setup and highlight some key trends in modern ASR. Listed here are the massive takeaways 👇



1. Conformer encoder 🤝 LLM decoder tops the charts 📈

thumbnail

Models combining Conformer encoders with large language model (LLM) decoders currently lead in English transcription accuracy. For instance, NVIDIA’s Canary-Qwen-2.5B, IBM’s Granite-Speech-3.3-8B, and Microsoft’s Phi-4-Multimodal-Instruct achieve the bottom word error rates (WER), showing that integrating LLM reasoning can significantly boost ASR accuracy.

💡 Pro-tip: NVIDIA introduced Fast Conformer, a 2x faster variant of the Conformer, that’s utilized in their Canary and Parakeet suite of models.



2. Speed–accuracy tradeoffs ⚖️

thumbnail

While highly accurate, these LLM decoders are inclined to be slower than simpler approaches. On the Open ASR Leaderboard, efficiency is measured using inverse real-time factor (RTFx), where higher is best.

For even faster inference, CTC and TDT decoders deliver 10–100× faster throughput, albeit with barely higher error rates. This makes them ideal for real-time, offline, or batch transcription tasks (similar to meetings, lectures, or podcasts).



3. Multilingual 🌍

thumbnail

OpenAI’s Whisper Large v3 stays a powerful multilingual baseline, supporting 99 languages. Nonetheless, fine-tuned or distilled variants like Distil-Whisper and CrisperWhisper often outperform the unique on English-only tasks, showing how targeted fine-tuning can improve specialization (tips on how to fine-tune? Take a look at guides for Whisper, Parakeet, and Voxtral).

That said, specializing in English tends to reduce multilingual coverage 👉 a classic case of the tradeoff between specialization and generalization. Similarly, while self-supervised systems like Meta’s Massively Multilingual Speech (MMS) and Omnilingual ASR can support 1K+ languages, they trail behind language-specific encoders in accuracy.

⭐ While just five languages are currently benchmarked, we’re planning to expand to more languages and are excited for brand new dataset and models contributions to multilingual ASR through GitHub pull requests.

🎯 Alongside multilingual benchmarks, several community-driven leaderboards give attention to individual languages. For instance, the Open Universal Arabic ASR Leaderboard compares models across Modern Standard Arabic and regional dialects, highlighting how speech variation and diglossia challenge current systems. Similarly. the Russian ASR Leaderboard provides a growing hub for evaluating encoder-decoder and CTC models on Russian-specific phonology and morphology. These localized efforts mirror the broader multilingual leaderboard’s mission to encourage dataset sharing, fine-tuned checkpoints, and transparent model comparisons, especially in languages with fewer established ASR resources.



4. Long-form transcription is a special game ⏳

thumbnail

For long-form audio (e.g., podcasts, lectures, meetings), closed-source systems still edge out open ones. It could possibly be as a result of domain tuning, custom chunking, or production-grade optimization.

Amongst open models, OpenAI’s Whisper Large v3 performs the most effective. But for throughput, CTC-based Conformers shine 👉 for instance, NVIDIA’s Parakeet CTC 1.1B achieves an RTFx of 2793.75, in comparison with 68.56 for Whisper Large v3, with only a moderate WER degradation (6.68 and 6.43 respectively).

The tradeoff? Parakeet is English-only, again reminding us of that multilingual and specialization tradeoff 🫠.

While closed systems still lead, there’s huge potential for open-source innovation here. Long-form ASR stays one of the vital exciting frontiers for the community to tackle next!

Given how briskly ASR is evolving, we’re excited to see what recent architectures push performance and efficiency, and the way the Open ASR Leaderboard continues to function a transparent, community-driven benchmark for the sphere, and as a reference for other leaderboards (Russian, Arabic, and Speech DeepFake Detection).

We’ll keep expanding the Open ASR LeaderBoard with more models, more languages, and more datasets so stay tuned 👀

👉 Need to contribute? Head on over to the GitHub repo to open a pull request 🚀



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x