Jean-Louis Quéguiner is the Founder and CEO of Gladia. He previously served as Group Vice President of Data, AI, and Quantum Computing at OVHcloud, certainly one of Europe’s leading cloud providers. He holds a Master’s Degree in Symbolic AI from the University of Québec in Canada and Arts et Métiers ParisTech in Paris. Over the course of his profession, he has held significant positions across various industries, including financial data analytics, machine learning applications for real-time digital promoting, and the event of speech AI APIs.
Gladia provides advanced audio transcription and real-time AI solutions for seamless integration into products across industries, languages, and technology stacks. By optimizing state-of-the-art ASR and generative AI models, it ensures accurate, lag-free speech and language processing. Gladia’s platform also enables real-time extraction of insights and metadata from calls and meetings, supporting key enterprise use cases corresponding to sales assistance and automatic customer support.
What inspired you to tackle the challenges in speech-to-text (STT) technology, and what gaps did you see out there?
Once I founded Gladia, the initial goal was broad—an AI company that might make complex technology accessible. But as we delved deeper, it became clear that voice technology was essentially the most broken and yet most crucial area to deal with.
Voice is central to our day by day lives, and most of our communication happens through speech. Yet, the tools available for developers to work with voice data were inadequate when it comes to speed, accuracy, and price—especially across languages.
I desired to fix that, to unpack the complexity of voice technology and repackage it into something easy, efficient, powerful and accessible. Developers shouldn’t should worry concerning the intricacies of AI models or the nuances of context length in speech recognition. My goal was to create an enterprise-grade speech-to-text API that worked seamlessly, whatever the underlying model or technology—a real plug-and-play solution.
What are among the unique challenges you encountered while constructing a transcription solution for enterprise use?
In the case of speech recognition, speed and accuracy—the 2 key performance indicators on this field—are inversely proportional by design. Because of this improving one will compromise the opposite, at the least to some extent. The associated fee factor, to an enormous extent, results from the provider’s alternative between speed and quality.
When constructing Gladia, our goal was to search out the proper balance between these two aspects, all while ensuring the technology stays available to startups and SMEs. In the method we also realized that the foundational ASR models like OpenAI’s Whisper, which we worked with extensively, are biased, skewering heavily towards English on account of their training data, which leaves a number of languages under-represented.
So, along with solving the speed-accuracy tradeoff, it was vital to us— as a European, multilingual team—to optimize and fine-tune our core models to construct a really global API that helps businesses operate across languages.
How does Gladia differentiate itself within the crowded AI transcription market? What makes your Whisper-Zero ASR unique?
Our latest real-time engine (Gladia Real Time) achieves an industry-leading 300 ms latency. Along with that, it’s in a position to extract insights from a call or meeting with the so-called “audio intelligence” add-ons or features, like named entity recognition (NER) or sentiment evaluation.
To our knowledge, only a few competitors are in a position to provide each transcription and insights at such high latency (lower than 1s end-to-end) – and do all of that accurately in languages apart from English. Our languages support extends to over 100 languages today.
We also put a special emphasis on making the product truly stack agnostic. Our API is compatible with all existing tech stacks and telephony protocols, including SIP, VoIP, FreeSwitch and Asterisk. Telephony protocols are especially complex to integrate with, so we imagine this product aspect can bring tremendous value to the market.
Hallucinations in AI models are a big concern, especially in real-time transcription. Are you able to explain what hallucinations are within the context of STT and the way Gladia addresses this problem?
Hallucination normally occurs when the model lacks knowledge or doesn’t have enough context on the subject. Although models can produce outputs tailored to a request, they will only reference information that existed on the time of their training, and that will not be up-to-date. The model will create coherent responses by filling in gaps with information that sounds plausible but is wrong.
While hallucinations became known within the context of LLMs first, they occur with speech recognition models— like Whisper ASR, a number one model in the sector developed by OpenAI – as well. Whisper’s hallucinations are like those of LLMs on account of the same architecture, so it’s an issue that concerns generative models, which can be in a position to predict the words that follow based on the general context. In a way, they ‘invent’ the output. This approach will be contrasted with more traditional, acoustic-based ASR architectures that match the input sound to output in a more mechanical way
Because of this, chances are you’ll find words in a transcript that weren’t actually said, which is clearly problematic, especially in fields like medicine, where a mistake of this type can have grave consequences.
There are several methods to administer and detect hallucinations. One common approach is to make use of a retrieval-augmented generation (RAG) system, which mixes the model’s generative capabilities with a retrieval mechanism to cross-check facts. One other method involves employing a “chain of thought” approach, where the model is guided through a series of predefined steps or checkpoints to be sure that it stays on a logical path.
One other strategy for detecting hallucinations involves using systems that assess the truthfulness of the model’s output during training. There are benchmarks specifically designed to judge hallucinations, which involve comparing different candidate responses generated by the model and determining which one is most accurate.
We at Gladia have experimented with a mix of techniques when constructing Whisper-Zero, our proprietary ASR that removes virtually all hallucinations. It’s proven excellent ends in asynchronous transcription, and we’re currently optimizing it for real-time to attain the identical 99.9% information fidelity.
STT technology must handle a big selection of complexities like accents, noise, and multi-language conversations. How does Gladia approach these challenges to make sure high accuracy?
Language detection in ASR is an especially complex task. Each speaker has a singular vocal signature, which we call features. By analyzing the vocal spectrum, machine learning algorithms can perform classifications, using the Mel Frequency Cepstral Coefficients (MFCC) to extract the major frequency characteristics.
MFCC is a technique inspired by human auditory perception. It’s a part of the “psychoacoustic” field, specializing in how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio right into a frequency spectrum.
Nevertheless, this approach has a limitation: it’s based purely on acoustics. So, for those who speak English with a powerful accent, the system may not understand the content but as an alternative judge based in your prosody (rhythm, stress, intonation).
That is where Gladia’s progressive solution is available in. We have developed a hybrid approach that mixes psycho-acoustic features with content understanding for dynamic language detection.
Our system doesn’t just hearken to the way you speak, but in addition understands what you are saying. This dual approach allows for efficient code-switching and doesn’t let strong accents get misrepresented/misunderstood.
Code-switching—which is amongst our key differentiators—is a very vital feature in handling multilingual conversations. Speakers may switch between languages mid-conversation (and even mid-sentence), and the power of the model to transcribe accurately on the fly despite the switch is critical.
Gladia API is exclusive in its ability to handle code-switching with this many language pairs with a high level of accuracy and performs well even in noisy environments, known to scale back the standard of transcription.
Real-time transcription requires ultra-low latency. How does your API achieve lower than 300 milliseconds latency while maintaining accuracy?
Keeping latency under 300 milliseconds while maintaining high accuracy requires a multifaceted approach that blends hardware expertise, algorithm optimization, and architectural design.
Real-time AI isn’t like traditional computing—it’s tightly linked to the ability and efficiency of GPGPUs. I’ve been working on this space for nearly a decade, leading the AI division at OVHCloud (the largest cloud provider within the EU), and learned firsthand that it’s at all times about finding the best balance: how much hardware power you would like, how much it costs, and the way you tailor the algorithms to work seamlessly with that hardware.
Performance in real time AI comes from effectively aligning our algorithms with the capabilities of the hardware, ensuring every operation maximizes throughput while minimizing delays.
However it’s not only the AI and hardware. The system’s architecture plays an enormous role too, especially the network, which might really impact latency. Our CTO, who has deep expertise in low-latency network design from his time at Sigfox (an IoT pioneer), has optimized our network setup to shave off beneficial milliseconds.
So, it’s really a mixture of all these aspects—smart hardware selections, optimized algorithms, and network design—that lets us consistently achieve sub-300ms latency without compromising on accuracy.
Gladia goes beyond transcription with features like speaker diarization, sentiment evaluation, and time-stamped transcripts. What are some progressive applications you’ve seen your clients develop using these tools?
ASR unlocks a big selection of applications to platforms across verticals, and it’s been amazing to see what number of truly pioneering firms have emerged within the last two years, leveraging LLMs and our API to construct cutting-edge, competitive products. Listed below are some examples:
- Smart note-taking: Many consumers are constructing tools for professionals who must quickly capture and organize information from work meetings, student lectures, or medical consultations. With speaker diarization, our API can discover who said what, making it easy to follow conversations and assign motion items. Combined with time-stamped transcripts, users can jump straight to specific moments in a recording, saving time and ensuring nothing gets lost in translation.
- Sales enablement: Within the sales world, understanding customer sentiment is all the pieces. Teams are using our sentiment evaluation feature to realize real-time insights into how prospects respond during calls or demos. Plus, time-stamped transcripts help teams revisit key parts of a conversation to refine their pitch or address client concerns more effectively. For this use case specifically, NER can be key to identifying names, company details, and other information that will be extracted from sales calls to feed the CRM mechanically.
- Call center assistance: Firms within the contract center space are using our API to offer live assistance to agents, in addition to flagging customer sentiment during calls. Speaker diarization ensures that things being said are assigned to the best person, while time-stamped transcripts enable supervisors to review critical moments or compliance issues quickly. This not only improves the client experience – with higher on-call resolution rate and quality monitoring – but in addition boosts agent productivity and satisfaction.
Are you able to discuss the role of custom vocabularies and entity recognition in improving transcription reliability for enterprise users?
Many industries depend on specialized terminology, brand names, and unique language nuances. Custom vocabulary integration allows the STT solution to adapt to those specific needs, which is crucial for capturing contextual nuances and delivering output that accurately reflects your corporation needs. As an illustration, it means that you can create a listing of domain-specific words, corresponding to brand names, in a selected language.
Why it’s useful: Adapting the transcription to the precise vertical means that you can minimize errors in transcripts, achieving a greater user experience. This feature is particularly critical in fields like medicine or finance.
Named entity recognition (NER) extracts and identifies key information from unstructured audio data, corresponding to names of individuals, organizations, locations, and more. A standard challenge with unstructured data is that this critical information isn’t readily accessible—it’s buried throughout the transcript.
To unravel this, Gladia developed a structured Key Data Extraction (KDE) approach. By leveraging the generative capabilities of its Whisper-based architecture—just like LLMs—Gladia’s KDE captures context to discover and extract relevant information directly.
This process will be further enhanced with features like custom vocabulary and NER, allowing businesses to populate CRMs with key data quickly and efficiently.
In your opinion, how is real-time transcription transforming industries corresponding to customer support, sales, and content creation?
Real-time transcription is reshaping these industries in profound ways, driving incredible productivity gains, coupled with tangible business advantages.
First, real-time transcription is a game-changer for support teams. Real-time assistance is essential to improving the resolution rate because of faster responses, smarter agents, and higher outcomes (when it comes to NSF, handle times, and so forth). As ASR systems improve and higher at handling non-English languages and performing real-time translation, contact centers can achieve a really global CX at lower margins.
In sales, speed and spot-on insights are all the pieces. Similarly to what happens with call agents, real-time transcription is what equips them with the best insights at the best time, enabling them to deal with what matters essentially the most in closing deals.
For creators, real-time transcription is probably less relevant today, but still stuffed with potential, especially in relation to live captioning and translation during media events. Most of our current media customers still prefer asynchronous transcription, as speed is less critical there, while accuracy is essential for applications like time-stamped video editing and subtitle generation.
Real-time AI transcription appears to be a growing trend. Where do you see this technology heading in the subsequent 5-10 years?
I feel like this phenomenon, which we now call real-time AI, goes to be in all places. Essentially, what we actually confer with here is the seamless ability of machines to interact with people, the best way we humans already interact with each other.
And for those who take a look at any Hollywood movie (like Her) set in the longer term, you’ll never see anyone there interacting with intelligent systems via a keyboard. For me, that serves as the final word proof that within the collective imagination of humanity, voice will at all times be the first way we interact with the world around us.
Voice, because the major vector to aggregate and share human knowledge, has been a part of human culture and history for for much longer than writing. Then, writing took over since it enabled us to preserve our knowledge more effectively than counting on the community elders to be the guardians of our stories and wisdom.
GenAI systems, able to understanding speech, generating responses, and storing our interactions, brought something completely latest to the space. It’s the very best of each words and the very best of humanity really. It gives us this unique power and energy of voice communication with the advantage of memory, which previously only written media could secure for us. That is why I imagine it’s going to be in all places – it’s our ultimate collective dream.