Generative AI Is Not a Death Sentence for Endangered Languages

In accordance with UNESCO, as much as half of languages could possibly be extinct by 2100. Many individuals say generative AI is contributing to this process.

The decline in language diversity didn’t start with AI—or the Web. But AI able to speed up the demise of indigenous and low-resource languages.

Many of the world’s 7,000+ languages don’t have sufficient resources to coach AI models—and lots of lack a written form. Which means that a couple of major languages dominate humanity’s stock of potential AI training data, while most stand to be left behind within the AI revolution—and will disappear entirely.

The straightforward reason is that almost all available AI training data is in English. English is the fundamental driver of huge language models (LLMs), and other people who speak less-common languages are finding themselves underrepresented in AI technology.

Consider these statistics from the World Economic Forum:

Two-thirds of all web sites are in English.
Much of the info that GenAI learns from is scraped from the net.
Fewer than 20% of the world’s population speaks English.

As AI becomes more embedded in our day by day lives, we should always all be desirous about language equity. AI has unprecedented potential to problem-solve at scale, and its promise mustn’t be limited to the English-speaking world. AI is creating conveniences and tools that enhance people’s personal and skilled lives for people in wealthy, developed nations.

Speakers of low-resource languages are accustomed to finding a shortage of representation in technology—from not finding web sites of their language to not having their dialect recognized by Siri. Plenty of the text that available to coach AI in lower-resourced languages is poor quality (itself translated with questionable accuracy) and narrow in scope.

How can society be certain that lower-resourced languages don’t get not noted of the AI equation? How can we be certain that language isn’t a barrier to the promise of AI?

In an effort toward language inclusivity, some major tech players have initiatives to coach huge multilingual language models (MLMs). Microsoft Translate, for instance, has pledged to support “every language, in all places.” And Meta has a “No Language Left Behind” promise. These are laudable, but are they realistic?

Aspiring toward one model that handles every language on the earth favors the privileged because there are far greater volumes of knowledge from the world’s major languages. After we start coping with lower-resource languages and languages with non-Latin scripts, training AI models becomes more arduous, time-consuming—and dearer. Consider it as an unintentional tax on underrepresented languages.

Advances in Speech Technology

AI models are largely trained on text, which naturally favors languages with deeper stores of text content. Language diversity could be higher supported with systems that don’t rely upon text. Human interaction at one time was all speech-based, and lots of cultures retain that oral focus. To higher cater to a world audience, the AI industry must progress from text data to speech data.

Research is making huge strides in speech technology, however it still lags behind text-based technologies. Research in speech processing is progressing, but direct speech-to-speech technology is way from mature. The fact is that the industry tends to maneuver cautiously, and just once a technology advances to a certain level.

TransPerfect’s newly released GlobalLink Live interpretation platform uses the more mature types of speech technology—automatic speech recognition (ASR) and text-to-speech (TTS)—again, since the direct speech-to-speech systems are usually not mature enough at this point. That being said, our research teams are preparing for the day when fully speech-to-speech pipelines are ready for prime time.

Speech-to-speech translation models offer huge promise within the preservation of oral languages. In 2022, Meta announced the primary AI-powered speech-to-speech translation system for Hokkien, a primarily oral language spoken by about 46 million people within the Chinese diaspora. It’s a part of Meta’s Universal Speech Translator project, which is developing latest AI models that it hopes will enable real-time speech-to-speech translation across many languages. Meta opted to open-source its Hokkien translation models, evaluation datasets, and research papers in order that others can reproduce and construct on its work.

Learning with Less

The indisputable fact that we as a world community lack resources around certain languages just isn’t a death sentence for those languages. That is where multi-language models do have a bonus, in that the languages learn from one another. All languages follow patterns. Because of information transfer between languages, the necessity for training data is lessened.

Suppose you have got a model that’s learning 90 languages and you must add Inuit (a bunch of indigenous North American languages). Because of information transfer, you have to less Inuit data. We’re finding ways to learn with less. The quantity of knowledge needed to fine-tune engines is lower.

I’m hopeful a couple of future with more inclusive AI. I don’t imagine we’re doomed to see hordes of languages disappear—nor do I feel AI will remain the domain of the English-speaking world. Already, we’re seeing more awareness around the difficulty of language equity. From more diverse data collection to constructing more language-specific models, we’re making headway.

Consider Fon, a language spoken by about 4 million people in Benin and neighboring African countries. Not too way back, a preferred AI model described Fon as a fictional language. A pc scientist named Bonaventure Dosseau, whose mother speaks Fon, was used to such a exclusion. Dosseau, who speaks French, grew up with no translation program to assist him communicate together with his mother. Today, he can communicate together with his mother due to a Fon-French translator that he painstakingly built. Today, there may be also a fledgling Fon Wikipedia.

In an effort to make use of technology to preserve languages, Turkish artist Refik Anadol has kicked off the creation of an open-source AI tool for Indigenous people. On the World Economic Summit, he asked: “How on Earth can we create an AI that doesn’t know the entire of humanity?”

We will’t, and we won’t.

Generative AI Is Not a Death Sentence for Endangered Languages

Advances in Speech Technology

Learning with Less

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Causal Inference Playbook: Advanced Methods Every Data Scientist Should Master

The 2026 Data Mandate: Is Your Governance Architecture a Fortress or a Liability?

The Current Status of The Quantum Software Stack

The Multi-Agent Trap

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

Generative AI Is Not a Death Sentence for Endangered Languages

Advances in Speech Technology

Learning with Less

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.