Benchmarking Text-to-Speech Models within the Wild

-



Automated measurement of the standard of text-to-speech (TTS) models could be very difficult. Assessing the naturalness and inflection of a voice is a trivial task for humans, but it surely is way more difficult for AI. That is why today, we’re thrilled to announce the TTS Arena. Inspired by LMSys‘s Chatbot Arena for LLMs, we developed a tool that enables anyone to simply compare TTS models side-by-side. Just submit some text, hearken to two different models speak it out, and vote on which model you think that is the very best. The outcomes will probably be organized right into a leaderboard that displays the community’s highest-rated models.



Motivation

The sector of speech synthesis has long lacked an accurate method to measure the standard of various models. Objective metrics like WER (word error rate) are unreliable measures of model quality, and subjective measures equivalent to MOS (mean opinion rating) are typically small-scale experiments conducted with few listeners. Because of this, these measurements are generally not useful for comparing two models of roughly similar quality. To handle these drawbacks, we’re inviting the community to rank models in an easy-to-use interface. By opening this tool and disseminating results to the general public, we aim to democratize how models are ranked and to make model comparison and selection accessible to everyone.



The TTS Arena

Human rating for AI systems is just not a novel approach. Recently, LMSys applied this method of their Chatbot Arena with great results, collecting over 300,000 rankings thus far. Due to its success, we adopted the same framework for our leaderboard, inviting any person to rank synthesized audio.

The leaderboard allows a user to enter text, which will probably be synthesized by two models. After listening to every sample, the user will vote on which model sounds more natural. Resulting from the risks of human bias and abuse, model names will probably be revealed only after a vote is submitted.



Chosen Models

We chosen several SOTA (State of the Art) models for our leaderboard. While most are open-source models, we also included several proprietary models to permit developers to match the state of open-source development with proprietary models.

The models available at launch are:

  • ElevenLabs (proprietary)
  • MetaVoice
  • OpenVoice
  • Pheme
  • WhisperSpeech
  • XTTS

Although there are various other open and closed source models available, we selected these because they’re generally accepted because the highest-quality publicly available models.



The TTS Leaderboard

The outcomes from Arena voting will probably be made publicly available in a dedicated leaderboard. Note that it’ll be initially empty until sufficient votes are collected, then models will steadily appear. As raters submit latest votes, the leaderboard will robotically update.

Just like the Chatbot Arena, models will probably be ranked using an algorithm much like the Elo rating system, commonly utilized in chess and other games.



Conclusion

We hope the TTS Arena proves to be a helpful resource for all developers. We would love to listen to your feedback! Please don’t hesitate to tell us if you’ve gotten any questions or suggestions by sending us an X/Twitter DM, or by opening a discussion in the community tab of the Space.



Credits

Special because of all of the individuals who helped make this possible, including Clémentine Fourrier, Lucian Pouget, Yoach Lacombe, Principal Horse, and the Hugging Face team. Specifically, I’d wish to thank VB for his time and technical assistance. I’d also wish to thank Sanchit Gandhi and Apolinário Passos for his or her feedback and support through the development process.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x