Introducing Whisper

-

Other existing approaches incessantly use smaller, more closely paired audio-text training datasets,[^reference-1] [^reference-2][^reference-3] or use broad but unsupervised audio pretraining.[^reference-4][^reference-5][^reference-6] Because Whisper was trained on a big and diverse dataset and was not fine-tuned to any specific one, it doesn’t beat models that specialise in LibriSpeech performance, a famously competitive benchmark in speech recognition. Nevertheless, after we measure Whisper’s zero-shot performance across many diverse datasets we discover it’s far more robust and makes 50% fewer errors than those models.

A couple of third of Whisper’s audio dataset is non-English, and it’s alternately given the duty of transcribing in the unique language or translating to English. We discover this approach is especially effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

3 COMMENTS

0 0 votes
Article Rating
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

3
0
Would love your thoughts, please comment.x
()
x