The State of Multilingual LLMs: Moving Beyond English

Artificial Intelligence

The State of Multilingual LLMs: Moving Beyond English

admin

February 10, 2024

The State of Multilingual LLMs: Moving Beyond English

In accordance with Microsoft research, around 88% of the world’s languages, spoken by 1.2 billion people, lack access to Large Language Models (LLMs). It is because most LLMs are English-centered, i.e., they’re mostly built with English data and for English speakers. This English dominance also prevails in LLM development and has resulted in a digital language gap, potentially excluding most individuals from the advantages of LLMs. To resolve this problem for LLMs, an LLM that could be trained in several languages and perform tasks in several languages is required. Enter Multilingual LLMs!

What are Multilingual LLMs?

A multilingual LLM can understand and generate text in multiple languages. They’re trained on datasets that contain different languages and may tackle various tasks in multiple language from a user’s prompt.

Multilingual LLM applications are enormous, they include translating literature into local dialects, real-time multilingual communication, multilingual content creation, etc. They might help everyone access information and refer to one another easily, regardless of their language.

Also, multilingual LLMs address challenges akin to lack of cultural nuances and context, training data limitations, and the potential loss of data during translation.

How do Multilingual LLMs Work?

Constructing a multilingual LLM involves rigorously preparing a balanced corpus of text in various languages and choosing an appropriate architecture and training technique for training the model, preferably a Transformer model, which is ideal for multilingual learning.

Source: Image by writer

One technique is to share embeddings, which capture the semantic meaning of words across different languages. This makes the LLM learn the similarities and differences of every language, enabling it to know the several languages higher.

This information also empowers the LLM to adapt to numerous linguistic tasks, like translating languages, writing in several styles, etc. One other technique used is cross-lingual transfer learning, where the model is pre-trained on a big corpus of multilingual data before being fine-tuned on specific tasks.

This two-step process ensures the model has a robust foundation in multilingual language understanding, making it adaptable to numerous downstream applications.

Examples of Multilingual Large Language Models

Multilingual LLM comparison chart

Source: Ruder.io

Several notable examples of multilingual LLMs have emerged, each catering to specific linguistic needs and cultural contexts. Let’s explore just a few of them:

1. BLOOM

BLOOM is an open-access multilingual LLM that prioritizes diverse languages and accessibility. With 176 billion parameters, BLOOM can handle tasks in 46 natural and 13 programming languages, making it certainly one of the largest and most diverse LLMs.

BLOOM’s open-source nature allows researchers, developers, and language communities to profit from its capabilities and contribute to its improvement.

2. YAYI 2

YAYI 2 is an open-source LLM designed specifically for Asian languages, considering the region’s complexities and cultural nuances. It was pre-trained from scratch on a multilingual corpus of over 16 Asian languages containing 2.65 trillion filtered tokens.

This makes the model give higher results, meeting the precise requirements of languages and cultures in Asia.

3. PolyLM

PolyLM is an open-source ‘polyglot’ LLM that focuses on addressing the challenges of low-resource languages by offering adaptation capabilities. It was trained on a dataset of about 640 billion tokens and is accessible in two model sizes: 1.7B and 13B. PolyLM knows over 16 different languages.

It enables models trained on high-resource languages to be fine-tuned for low-resource languages with limited data. This flexibility makes LLMs more useful in several language situations and tasks.

4. XGLM

XGLM, boasting 7.5 billion parameters, is a multilingual LLM trained on a corpus covering a various set of over 20 languages using the few-shot learning technique. It is a component of a family of large-scale multilingual LLMs trained on an enormous dataset of text and code.

It goals to cover many languages completely, which is why it focuses on inclusivity and linguistic diversity. XGLM demonstrates the potential for constructing models catering to the needs of varied language communities.

5. mT5

The mT5 (massively multilingual Text-to-Text Transfer Transformer) was developed by Google AI. Trained on the common crawl dataset, mt5 is a state-of-the-art multilingual LLM that may handle 101 languages, starting from widely spoken Spanish and Chinese to less-resourced languages like Basque and Quechua.

It also excels at multilingual tasks like translation, summarization, question-answering, etc.

Is a Universal LLM Possible?

The concept of a language-neutral LLM, able to understanding and generating language without bias towards any particular language, is intriguing.

While developing a very universal LLM continues to be far-off, current multilingual LLMs have demonstrated significant success. Once developed fully, they will cater to the needs of under-represented languages and diverse communities.

For example, research shows that almost all multilingual LLMs can facilitate zero-shot cross-lingual transfer from a resource-rich language to a resource-deprived language without task-specific training data.

Also, models like YAYI and BLOOM, which concentrate on specific languages and communities, have demonstrated the potential of language-centric approaches in driving progress and inclusivity.

To construct a universal LLM or improve current Multilingual LLMs, individuals and organizations must do the next:

Crowdsource native speakers for community engagement and curation of the language datasets.
Support community efforts regarding open-source contributions and funding to multilingual research and developments.

Challenges of Multilingual LLMs

While the concept of universal multilingual LLMs holds great promise, additionally they face several challenges that should be addressed before we are able to profit from them:

1. Data Quantity

Multilingual models require a bigger vocabulary to represent tokens in lots of languages than monolingual models, but many languages lack large-scale datasets. This makes it difficult to coach these models effectively.

2. Data Quality Concerns

Ensuring the accuracy and cultural appropriateness of multilingual LLM outputs across languages is a major concern. Models must train and fine-tune with meticulous attention to linguistic and cultural nuances to avoid biases and inaccuracies.

3. Resource Limitations

Training and running multilingual models require substantial computational resources akin to powerful GPUs (e.g., NVIDIA A100 GPU). The high cost poses challenges, particularly for low-resource languages and communities with limited access to computational infrastructure.

4. Model Architecture

Adapting model architectures to accommodate diverse linguistic structures and complexities is an ongoing challenge. Models must have the option to handle languages with different word orders, morphological variations, and writing systems while maintaining high performance and efficiency.

5. Evaluation Complexities

Evaluating the performance of multilingual LLMs beyond English benchmarks is critical for measuring their true effectiveness. It requires considering cultural nuances, linguistic peculiarities, and domain-specific requirements.

Multilingual LLMs have the potential to interrupt language barriers, empower under-resourced languages, and facilitate effective communication across diverse communities.

Don’t miss out on the most recent news and evaluation in AI and ML – visit unite.ai today.