Discover more in our official blogpost, featuring an interactive experience
The journey of constructing world-class Arabic language models has been considered one of continuous learning and iteration. Today, we’re excited to announce Falcon-H1-Arabic, our most advanced Arabic language model family up to now, representing a major breakthrough in each architecture and capabilities. This release embodies months of research, community feedback, and technical innovation, culminating in three powerful models that set latest standards for Arabic natural language processing.
Constructing on Success: The Evolution from Falcon-Arabic
Once we launched Falcon-Arabic just a few months ago, the response from the community was each humbling and enlightening. Developers, researchers and students across the Arab world used the model for real use cases, pushing them to its limits and providing invaluable feedback. We learned where the model excelled and, more importantly, where it struggled. Long-context understanding, dialectal variations, mathematical reasoning, and domain-specific knowledge emerged as key areas requiring deeper attention.
We didn’t just have the desire to make incremental improvements, we desired to fundamentally rethink our approach. The result’s Falcon-H1-Arabic, a model family that addresses every bit of feedback we received while introducing architectural innovations that were previously unexplored in Arabic language modeling.

Falcon-H1-Arabic 3B, 7B, 34B models outperforming all SOTA models of comparable sizes and sometimes greater.
A First for Arabic NLP: Hybrid Mamba-Transformer Architecture
Falcon-H1-Arabic is built on the Falcon-H1 hybrid architecture, which integrates State Space Models (Mamba) and Transformer attention inside every block. Each components run in parallel and their representations are fused before the block’s output projection. This design provides the linear-time scalability of Mamba for very long sequences while preserving the precise long-range modeling capabilities of attention. For Arabic, with its wealthy morphology and versatile sentence structures, this approach significantly improves coherence and reasoning across prolonged text. We have deployed this architecture across three scales (3B, 7B, 34B parameters), each balancing capability, efficiency, and deployability for various use cases from edge devices to enterprise applications.

Falcon-H1 architecture. Attention and SSM run in parallel inside each block; their outputs are concatenated before the block’s output projection. The variety of SSM/Attention heads is dependent upon the model size. More details on the Falcon-H1 technical report.
Breaking Context Boundaries
We have dramatically increased context capabilities from Falcon-Arabic’s 32K limit to 128K tokens for the 3B model and 256K tokens for each the 7B and 34B models. At 256K tokens (~200,000 words), these models can process several novels or a whole lot of pages of technical documentation enabling applications in legal evaluation, medical records, academic research, and prolonged conversations that were previously impractical. Our post-training specifically addresses “lost in the center” challenges to make sure models effectively utilize their full context range, not only accept long inputs.
| Parameters | Context Window | Architecture | Ideal Uses |
|---|---|---|---|
| 3B | 128K | Hybrid | Fast agents, high-QPS systems, lightweight analytics |
| 7B | 256K | Hybrid | Production assistants, reasoning, enterprise chat |
| 34B | 256K | Hybrid | Long-document evaluation, research, high-stakes tasks |
Data Quality and Diversity: The Foundation of Excellence
We rebuilt our pre-training data pipeline from the bottom up to higher reflect the complexity of Arabic. This began with a multi-stage quality filtering process tailored to Arabic orthography, morphology, diacritics, and syntactic patterns. As a substitute of heuristic filtering, we used deep linguistic evaluation to isolate coherent, well-structured text and take away noise commonly present in open-web corpora. The result’s a significantly cleaner, more stylistically consistent Arabic dataset.
Dialect coverage was one other key priority. Arabic just isn’t monolithic; Modern Standard Arabic coexists with dialects corresponding to Egyptian, Levantine, Gulf, and Maghrebi, each with distinct vocabularies and grammatical constructions. We expanded dialectal sources substantially so the models would understand and generate the total spectrum of real-world Arabic moderately than leaning disproportionately toward formal MSA. To keep up global reasoning and domain diversity, we also preserved the multilingual capabilities of Falcon-H1 by training the Arabic models on an almost equal mixture of Arabic, English, and multilingual content totalling around 300 Billion Tokens. This ensures strong performance in code, STEM, and cross-lingual reasoning. The next figure illustrates the distribution of the pre-training data across languages and categories. All values are expressed in billions of tokens.

Post-Training: Refining Capabilities Without Compromising Competence
After pre-training, Falcon-H1-Arabic undergoes a focused post-training pipeline consisting of supervised fine-tuning (SFT) followed by direct preference optimization (DPO). During SFT, we expose the models to high-quality Arabic instructions, curated long-context examples, and structured reasoning tasks that teach them to follow directives, maintain coherence over prolonged sequences, and ground their responses in relevant information. This stage is crucial for ensuring that the models can actually use their large context windows which doesn’t emerge routinely from architecture alone.
We follow SFT with a targeted DPO phase to refine alignment, conversational quality, and preference consistency. DPO helps the models balance long-context reasoning with general linguistic competence, improving helpfulness and reducing common failure modes corresponding to drifting, overuse of context, or neglecting earlier information. Throughout each stages, we rigorously monitor for catastrophic forgetting and maintain a controlled curriculum so gains in long-context behavior don’t come on the expense of core reasoning or factual accuracy. The result’s a family of models that handles prolonged documents and dialogue with ease while preserving strong performance on on a regular basis language tasks.
Beyond benchmark-oriented optimization, our post-training process deliberately strengthens areas that traditional evaluations don’t fully capture, including conversational faithfulness, rhetorical organization, structured follow-ups, and discourse coherence. These enhancements significantly boost the model’s practical usefulness, making Falcon-H1-Arabic more dependable in real multi-turn dialogue, instruction execution, and long-context conversational flows.
Benchmark Performance: Setting Latest Standards
Numbers tell a very important a part of the story. On the Open Arabic LLM Leaderboard (OALL), a comprehensive benchmark evaluating Arabic language understanding across diverse tasks, Falcon-H1-Arabic achieves state-of-the-art results at every scale we tested. Note that our scores may vary barely from those reported on the leaderboard, as we used vLLM because the backend as a substitute of the leaderboard’s Speed up-based implementation. These differences are typically under one point while offering significantly faster runtime.
Beyond OALL, we also report results on the 3LM benchmark for STEM-related tasks on each synthetic and native splits; Arabculture for Arabic culture assessment; and AraDice for Arabic dialect coverage across Levantine, and Egyptian varieties in addition to Arabic culture across 6 countries. The reported AraDice rating is the common of all of the 3 scores.
Starting with the 3B model, the performance is outstanding. It reaches roughly 62% on OALL, outperforming all small-scale models, including Gemma-4B, Qwen3-4B, and Phi-4-mini by roughly ten points. On 3LM, the primary Arabic STEM benchmark, it scores around 82% on the native split and 73% on the synthetic split. It also achieves about 62% on the ArabCulture benchmark and around 50% across AraDice dialect evaluation (Egyptian, Gulf, and Levantine). This makes Falcon-H1-Arabic-3B a high-quality, highly efficient model suitable for edge deployments, real-time applications, and agentic systems where latency and price matter.
The 7B model continues this upward trajectory. With a rating of 71.7% on OALL, it surpasses all models within the ~10B class, including Fanar-9B, Allam-7B*, and Qwen3-8B. On 3LM, it reaches about 92% on the native split and 85% on the synthetic one. AraDice scores rise into the mid-50s across all dialects, and ArabCulture results approach 80%. This model strikes a super balance between capability and deployability, making it essentially the most practical alternative for general-purpose Arabic NLP in production environments.
The 34B model represents our flagship system and establishes a brand new cutting-edge for Arabic language modeling. It reaches roughly 75% on OALL, outperforming not only models of comparable size but even much larger systems corresponding to Llama-3.3-70B and AceGPT2-32B. Its 3LM scores reach about 96% on the native split and 94% on the synthetic one. On ArabCulture it scores near 80%, and on AraDice it reaches around 53 across dialects. The proven fact that a 34B hybrid model surpasses the performance of 70B-scale transformers demonstrates the effectiveness of the Falcon-H1 architecture, the standard of the info, and the strength of the post-training pipeline.
These benchmark results validate our approach but in addition highlight a very important reality: the frontier of Arabic language modeling is advancing rapidly. Each percentage point on these benchmarks represents countless hours of engineering effort, careful dataset curation, and architectural refinement. The margins by which Falcon-H1-Arabic leads aren’t just statistical artifacts, they translate to meaningfully higher user experiences in real-world applications.
Practical Applications: From Edge to Enterprise
Each model within the Falcon-H1-Arabic family is suited to different deployment scenarios. The 3B model is optimized for speed, cost-efficiency, and high-throughput systems, making it ideal for agentic workflows, on-device applications, low-latency chat, and environments with strict resource constraints. The 7B model serves because the general-purpose workhorse for many production applications, powering document understanding systems, chatbots, summarization pipelines, and content generation tools. The 34B model is designed for high-stakes domains where accuracy and long-range reasoning matter most, including legal evaluation, medical summarization, academic research, and large-scale enterprise automation. Its prolonged context window makes it uniquely able to analyzing a whole lot of pages of text in a single pass while maintaining precise coherence.
Responsible AI and Limitations
Like all language models, Falcon-H1-Arabic may reflect biases from training data and might produce hallucinated information. Model outputs shouldn’t be used as sole authorities for medical, legal, or financial decisions without skilled verification. Long-context performance may degrade at extreme ranges. We recommend task-specific evaluation and appropriate guardrails before deployment in production or sensitive applications.
Acknowledgments
This work stands on the shoulders of many. We extend our gratitude to the Arabic NLP research community, whose open sharing of benchmarks, datasets, and methodologies enables progress across the sector. Special due to our colleagues at TII: Ilyas Chahed, Younes Belkada, Dhia Eddine Rhaiem, Puneesh Khanna, Jingwei Zuo, Mikhail Lubinets, Slim Frikha, Maksim Velikanov, Kacper Piskorski, and Suhail Mohmad for his or her invaluable support during this project.
Citation
@misc{Falcon-H1-Arabic-2025,
title={Falcon-H1-Arabic: State-of-the-Art Arabic Language Models with Hybrid Mamba-Transformer Architecture},
writer={Basma El Amel Boussaha and Mohammed Alyafeai and Ahmed Alzubaidi and Leen AlQadi and Shaikha Alsuwaidi and Omar Alkaabi and Hamza Alobeidli and Hakim Hacid},
url={https://huggingface.co/blog/tiiuae/falcon-h1-arabic},
month={December},
12 months={2025},
note={Available in 3B, 7B, and 34B parameter versions}
}
- NB: the scores of ALLaM-7B-Instruct-preview in our evaluation are higher than those reported on the OALL leaderboard, as we used the most recent release (7b-alpha-v2.33.0.30), while the leaderboard currently reflects results from the older version (7b-alpha-v1.27.2.25).
Falcon-H1-Arabic models can be found to be used on the links below. For questions, collaborations, or feedback, reach us at falcon.info@tii.ae or join our community:







