A Breakthrough in Arabic Language Models

Take a look at our official blogpost (EN, AR)

We’re excited to introduce Falcon-Arabic, a 7B parameter Language Model that sets a brand new benchmark for Arabic NLP. Built on the Falcon 3 architecture, Falcon-Arabic is a multilingual model that supports Arabic, English, and several other other languages. It excels on the whole knowledge, Arabic grammar, mathematical reasoning, complex problem solving, and understanding the wealthy diversity of Arabic dialects. Falcon-Arabic supports a context length of 32,000 tokens, allowing it to handle long documents and enabling advanced applications like retrieval-augmented generation (RAG), in-depth content creation, and knowledge-intensive tasks.

Falcon-Arabic redefines the boundaries of what is feasible for Arabic Language Models. It significantly outperforms other Arabic LLMs in its size category and even models as much as 4 times larger across each Arabic-native models and people adapted from other languages. This makes Falcon-Arabic not only a state-of-the-art model by way of performance, but additionally a uniquely efficient and accessible solution for developers and researchers working with the Arabic language.

🚀 Introducing Falcon-Arabic: Advancing LLMs for the Arabic-Speaking World

In recent times, Large Language Models (LLMs) have transformed Artificial Intelligence, powering tools for translation, content creation, virtual assistance, and more. Yet much of this progress has focused on highly represented languages like English, leaving languages equivalent to Arabic underrepresented. Arabic presents unique challenges it’s morphologically wealthy, diglossic (spanning each Modern Standard Arabic (MSA) and diverse regional dialects), and used across an enormous and culturally varied population. Developing robust Arabic LLMs is crucial to make sure Arabic-speaking communities are fully included within the AI revolution.

With this goal in mind, we’re introducing Falcon-Arabic a specialized adaptation of the Falcon 3 model family, developed by the Technology Innovation Institute (TII) within the UAE. The Falcon models have earned global recognition for his or her multilingual strength and open-source approach. Falcon-Arabic builds on this legacy, bringing advanced language understanding and generation to Arabic. By training the model to handle each Modern Standard Arabic and key dialects, Falcon-Arabic fills a critical gap in language technology enabling more natural, intelligent, and inclusive Arabic AI across the Gulf, Middle East, and North Africa.

🦅 Falcon-Arabic Has Landed – Here’s the Training Recipe 🧪

Constructing Falcon-Arabic began with a strategic decision: fairly than training a model from scratch, we selected to adapt a powerful multilingual foundation. Within the Arabic LLM landscape, three major approaches exist: training from scratch (e.g., Jais-native), adapting multilingual models (like Allam or Fanar), or using models that natively support Arabic alongside other languages (equivalent to Qwen or LLaMA). Observing the Open Arabic LLM Leaderboard, it became clear that adapted and multilingual models consistently outperformed others in each efficiency and capability. To construct on that momentum, we chosen Falcon 3-7B, a model that strikes a practical balance between performance and resource efficiency inside the Falcon 3 family developed by the Technology Innovation Institute (TII).

The core challenge was adapting Falcon 3-7B, which originally lacked Arabic support on the tokenizer and embedding level. We addressed this by extending the tokenizer’s vocabulary with 32,000 Arabic-specific tokens, and applying a novel embedding initialization strategy based on textual similarity. This method mapped recent Arabic tokens to semantically related embeddings from the prevailing vocabulary, allowing the model to inherit prior knowledge and speed up learning particularly around sentiment, abstract concepts, and reasoning patterns. This gave Falcon-Arabic a head start in understanding and generating high-quality Arabic text.

With the tokenizer and embeddings in place, we began continuous pretraining on high-quality, 100% native Arabic datasets, avoiding using machine-translated content to attenuate cultural bias and preserve linguistic authenticity. Training followed a multi-stage curriculum: early stages focused on general knowledge and dialect-rich Arabic content to stabilize the model and reinforce logical capabilities, while later phases emphasized math, code, and reasoning. The result’s a model that not only speaks Arabic fluently across dialects, but additionally retains Falcon’s multilingual and reasoning strengths pushing the boundaries for Arabic-first AI.

Average Performance of Pretrained Models

📊 Falcon-Arabic: Raising the Bar in Arabic LLMs

We evaluated Falcon-Arabic on OALL v2, the leading benchmark for Arabic Language Models. It includes six multiple-choice tasks equivalent to Arabic MMLU (native and translated), Arabic Exams, Alghafa, MadinahQA, Aratrust and one generative benchmark, Alrage. Falcon-Arabic outperforms all existing Arabic LLMs in its size range and even surpasses models as much as 4× larger. It leads in key benchmarks like Arabic MMLU, Exams, MadinahQA, and Aratrust, setting a brand new standard for Arabic-first Language Models.

Comparison Table of Pretrained Models

The evaluation details (log probabilities, predictions and LLM as judge metrics) of Falcon-Arabic-7B-Base can be found on https://huggingface.co/datasets/tiiuae/Falcon-Arabic-7B-Base-details

🗣️ From Pretraining to Instruct: Aligning Falcon-Arabic for Conversations

After finalizing the bottom model training, we performed a post-training alignment phase to fine-tune Falcon-Arabic’s responses in line with human preferences. This phase began with supervised fine-tuning (SFT) using a mix of high-quality public datasets and internally collected native Arabic instruction data, covering a variety of tasks and conversational scenarios.

To further enhance alignment, we applied Direct Preference Optimization (DPO) a reinforcement learning-based method that tunes the model to prefer outputs that humans rate as more helpful, secure, and relevant. This two-step process ensures that Falcon-Arabic Instruct not only understands Arabic well but responds in a way that aligns with real user expectations.

Average Performance of Instruct Models

As shown in the outcomes plots, Falcon-Arabic Instruct leads the pack, outperforming all other Instruct-aligned Arabic LLMs in its size class and even models significantly larger across multiple benchmarks. The model demonstrates strong performance in each instruction following and open-ended dialogue, setting a brand new standard for Arabic conversational AI.

Performance of Instruct Models by Benchmark

Comparison Table of Chat Models

The evaluation details (log probabilities, predictions and LLM as judge metrics) of Falcon-Arabic-7B-Instruct can be found on https://huggingface.co/datasets/tiiuae/Falcon-Arabic-7B-Instruct-details

🔓 Unlocking the Potential of Arabic AI

Falcon-Arabic sets a brand new benchmark for Arabic Language Models. With only 7B parameters, it delivers state-of-the-art performance outperforming models of comparable size and even those several times larger across key benchmarks like Arabic MMLU, MadinahQA, and Aratrust. It combines fluency in Modern Standard Arabic, strong understanding of regional dialects, and robust reasoning and multilingual capabilities, making it ideal for a wide selection of applications: from Arabic-first chatbots and academic tools to content generation, code assistance, and document understanding.

To present you a hands-on feel for what Falcon-Arabic can do, we built a straightforward demo that showcases its capabilities in machine translation though the model hasn’t been fine-tuned specifically for that task. The tool runs purely on Falcon-7B-Arabic-Instruct, and the outcomes are surprisingly strong across various translation directions. You may try it yourself through the demo linked slightly below. In actual fact, we used the identical setup to translate this blog post into Arabic for our Arabic-speaking audience. Test it out here 🚀. And in case you’re curious to explore more, we also provide access to a live playground where you possibly can interact with Falcon-Arabic Instruct and experience its performance across different tasks ✨.

⚠️ Limitations

Like all Large Language Models, Falcon-Arabic inherits some common limitations. These include occasional hallucinations (producing plausible but incorrect outputs), sensitivity to how prompts are phrased, and ranging performance across very long contexts. While Falcon-Arabic is designed to cut back these issues especially for Arabic tasks users should still apply critical considering when interpreting results, particularly in high-stakes or fact-sensitive use cases.

Citation

When you find this work helpful in your research or projects, please consider citing it.

@misc{falcon-arabic,
    title = {Falcon-Arabic: A Breakthrough in Arabic Language Models},
    creator = {Falcon-LLM Team},
    month = {May},
    url = {https://falcon-lm.github.io/blog/falcon-arabic},
    yr = {2025}
}

Source link

A Breakthrough in Arabic Language Models

🚀 Introducing Falcon-Arabic: Advancing LLMs for the Arabic-Speaking World