Voxtral transcribes on the speed of sound.

-


Today, we’re releasing Voxtral Transcribe 2, two next-generation speech-to-text models with state-of-the-art transcription quality, diarization, and ultra-low latency. The family includes Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Voxtral Realtime is open-weights under the Apache 2.0 license.

We’re also launching an audio playground in Mistral Studio to check transcription immediately, powered by Voxtral Transcribe 2, with diarization and timestamps.

Highlights.

  • Voxtral Mini Transcribe V2: State-of-the-art transcription with speaker diarization, context biasing, and word-level timestamps in 13 languages.

  • Voxtral Realtime: Purpose-built for live transcription with latency configurable all the way down to sub-200ms, enabling voice agents and real-time applications.

  • Best-in-class efficiency: Industry-leading accuracy at a fraction of the associated fee, with Voxtral Mini Transcribe V2 achieving the bottom word error rate, at the bottom price point.

  • Open weights: Voxtral Realtime ships under Apache 2.0, deployable on edge for privacy-first applications.

Voxtral Realtime.

Voxtral Realtime is purpose-built for applications where latency matters. Unlike approaches that adapt offline models by processing audio in chunks, Realtime uses a novel streaming architecture that transcribes audio because it arrives. The model delivers transcriptions with delay configurable all the way down to sub-200ms, unlocking a brand new class of voice-first applications.

Fleur Voxtral 2

Word error rate (lower is healthier) across languages within the FLEURS transcription benchmark.

At 2.4 seconds delay, ideal for subtitling, Realtime matches Voxtral Mini Transcribe V2, our latest batch model. At 480ms delay, it stays inside 1-2% word error rate, enabling voice agents with near-offline accuracy.

The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.

We’re releasing the model weights under Apache 2.0 on the Hugging Face Hub.

Voxtral Mini Transcribe V2.

Voxtral 2.0   Avg Diarization Error Rate   Priceper Min

Average diarization error rate (lower is healthier) across five English benchmarks (Switchboard, CallHome, AMI-IHM, AMI-SDM, SBCSAE) and the TalkBank multilingual benchmark (German, Spanish, English, Chinese, Japanese).

Voxtral 2.0   Transcription Performance Fleurs   Priceper Min

Average word error rate (lower is healthier) across the top-10 languages within the FLEURS transcription benchmark.

Voxtral Mini Transcribe V2 delivers significant improvements in transcription and diarization quality across languages and domains. At roughly 4% word error rate on FLEURS and $0.003/min, Voxtral offers one of the best price-performance of any transcription API. It outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy, and processes audio roughly 3x faster than ElevenLabs’ Scribe v2 while matching on quality at one-fifth the associated fee.

Enterprise-ready features.

Voxtral Mini Transcribe V2 introduces key capabilities for enterprise deployments.

Icon Language

Speaker diarization.

Generate transcriptions with speaker labels and precise start/end times. Ideal for meeting transcription, interview evaluation, and multi-party call processing. Note: with overlapping speech, the model typically transcribes one speaker.

Icon Filters

Context biasing.

Provide as much as 100 words or phrases to guide the model toward correct spellings of names, technical terms, or domain-specific vocabulary. Particularly useful for correct nouns or industry terminology that standard models often miss. Context biasing is optimized for English; support for other languages is experimental.

Word-level timestamps.

Word-level timestamps.

Generate precise start and end timestamps for every word, enabling applications like subtitle generation, audio search, and content alignment.

Icon Earth Black

Expanded language support.

Like Realtime, this model now supports 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. Non-English performance significantly outpaces competitors.

Noise robustness.

Noise robustness.

Maintains transcription accuracy in difficult acoustic environments, comparable to factory floors, busy call centers, and field recordings.

Longer audio support.

Longer audio support.

Process recordings as much as 3 hours in a single request.

FlEURS

Word error rate (lower is healthier) across languages within the FLEURS transcription benchmark.

Audio playground.

Test Voxtral Transcribe 2 directly in Mistral Studio. Upload as much as 10 audio files, toggle diarization, select timestamp granularity, and add context bias terms for domain-specific vocabulary. Supports .mp3, .wav, .m4a, .flac, .ogg as much as 1GB each.

Transforming voice applications.

Voxtral powers voice workflows in diverse applications and industries.

  • Meeting intelligence.

    Transcribe multilingual recordings with speaker diarization that clearly attributes who said what and when. At Voxtral’s price point, annotate large volumes of meeting content at industry-leading cost efficiency.

  • Voice agents and virtual assistants.

    Construct conversational AI with sub-200ms transcription latency. Connect Voxtral Realtime to your LLM and TTS pipeline for responsive voice interfaces that feel natural.

  • Contact center automation.

    Transcribe calls in real time, enabling AI systems to research sentiment, suggest responses, and populate CRM fields while conversations are still happening. Speaker diarization ensures clear attribution between agents and customers.

  • Media and broadcast.

    Generate live multilingual subtitles with minimal latency. Context biasing handles proper nouns and technical terminology that trip up generic transcription services.

  • Compliance and documentation.

    Monitor and transcribe interactions for regulatory compliance, with diarization providing clear speaker attribution and timestamps enabling precise audit trails.

Each models support GDPR and HIPAA-compliant deployments through secure on-premise or private cloud setups.

Start.

Voxtral Mini Transcribe V2 is out there now via API at $0.003 per minute. Try it now in the brand new Mistral Studio audio playground or in Le Chat.

Voxtral Realtime is out there via API at $0.006 per minute and as open weights on Hugging Face.

Explore documentation on Mistral’s audio and transcription capabilities.

We’re hiring.

In the event you’re enthusiastic about constructing world-class speech AI and putting frontier models into the hands of developers in all places, we might love to listen to from you. Apply to hitch our team.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x