Current status of Arabic LLMs leaderboards
The growing availability of LLMs supporting Arabic, each as monolingual and multilingual models, prompted the community to create dedicated Arabic language leaderboards. Previously, Arabic-focused leaderboards were typically confined to narrow benchmarks introduced by specific authors, often as demos for his or her work. In these cases, the authors would arrange leaderboards to display how models performed on a specific task or dataset. Alternatively, other leaderboards required users to run evaluations on their very own computing resources after which submit a JSON file containing their results for display.
While these approaches helped spark initial interest in Arabic benchmarking, additionally they introduced several challenges:
- Resource Limitations: Many community members lack access to the substantial computational resources needed to guage all available open-source models so as to establish which model could be best for his or her downstream project or application, being forced to rely only on the outcomes shared by model makers of their documentation, which again and again doesn’t allow for a direct comparison. This high cost in each time and compute power can change into a serious barrier to participation in further developing Arabic LLMs, making a leaderboard a worthwhile shared resource.
- Integrity of Reported Results: Because some platforms required users to guage their models independently after which simply submit a file of scores, there was no robust mechanism to make sure those results were accurate and even produced through a real evaluation. This lack of centralized verification could potentially undermine the credibility and fairness of the leaderboard.
These limitations underscore the necessity for a more unified, accessible, and transparent benchmarking platform—one which not only enables but encourages real and reproducible experimentation for all the Arabic NLP community. To deal with these issues, in May 2024, 2A2I, TII, and HuggingFace launched the primary version of the Open Arabic LLM Leaderboard – OALL [1], featuring 14 benchmarks across a wide selection of tasks including reading comprehension, sentiment evaluation, and query answering amongst others.
In September 2024, a collaboration between SDAIA and the King Salman Global Academy for Arabic Language introduced the Balsam Index, which incorporates roughly 1,400 datasets with 50,000 questions covering 67 tasks, corresponding to grammar correction, paraphrasing, cause-and-effect classification, and text comprehension … etc.
Later that 12 months, on December fifth, Inception and MBZUAI announced the AraGen Leaderboard, the primary generative tasks leaderboard for Arabic, introducing the 3C3H evaluation metric, which uses dynamic evaluation cycles with private test, and provides a native-Arabic and culturally-aware generative tasks dataset, AraGen Bench, to evaluate LLMs across 4 primary tasks.
And to finish the 12 months strong, on nineteenth of December 2024, Scale’s Safety, Evaluations, and Alignment Lab (SEAL) published an Arabic leaderboard as a part of their multilingual leaderboards. The benchmark empowering this leaderboard stays at all times private like all the opposite languages inside their family of leaderboards, and relies on human-preference evaluation, using a dataset of 1,000 Arabic prompts designed to reinforce chatbot interaction capabilities across complex and culturally nuanced conversations.
Impact of the previous leaderboard
In lower than 7 months after its launch, the primary version of the Open Arabic LLM Leaderboard quickly became a significant platform for the Arabic AI community, attracting over 46,000 visitors and greater than 2,000 visits up to now month (January 2025). The HuggingFace space received over 100 likes and eight citations on Google Scholar. The community submitted greater than 700 models, starting from 1B to over 70B parameters. The submitted models originate from greater than 180 unique organizations making it one of the vital lively LLM evaluation leaderboards. Since its launch, the leaderboard sparked quite a few engaging discussions across social media, HuggingFace, Reddit, making it essentially the most distinguished Arabic leaderboard to this point.
As depicted in Figure 1, among the many ~700 models submitted to the initial version of the leaderboard, the bulk are chat and finetuned models, comprising over 70%, whereas pretrained models constitute only 11%. When it comes to model size, greater than 50% of the models are smaller than 7B parameters.
In comparison with leaderboards for other languages and as shown in Figure 2, the Open Arabic LLM leaderboard stands out as one of the vital lively, following closely behind the Korean, Polish, and Portuguese leaderboards, all inside lower than a 12 months of its launch. Considering that Arabic is one of the vital spoken languages globally, yet has relatively limited content available on the web, these figures carry even greater significance in comparison with other languages.
Why do we’d like a brand new leaderboard?
Recent discussions inside the community, including critiques of the Open Arabic LLM Leaderboard (OALL) and similar initiatives, have highlighted key shortcomings in current benchmarking practices [2]. Many researchers, developers, and language enthusiasts have emphasized the necessity for more direct evaluations of Arabic-specific tasks, increased transparency in how benchmarks are created, and the inclusion of more diverse datasets that reflect the breadth of Arabic dialects, domains, and real-world applications. These insights have played a central role in shaping the updated leaderboard.
The Arabic language presents unique challenges and characteristics that require specialized evaluation beyond what general NLP tasks can capture. These include intricate grammar, wealthy and sophisticated morphology, the variety of spoken dialects, and culturally nuanced safety-related considerations. A leaderboard that addresses these aspects can provide a clearer picture of how well models perform in real-world Arabic language contexts.
In the primary iteration of OALL, a big portion of datasets and tasks originated from non-Arabic-speaking contexts. When adapted to Arabic, these tasks often didn’t reflect real-world use cases or meet the sensible needs of Arabic-speaking communities. Many tasks were direct translations from English, which continuously introduced linguistic and contextual mismatches. This approach neglected Arabic’s unique morphological and syntactic complexities, making the tasks less effective in measuring true language understanding and modeling capabilities.
Moreover, some benchmarks from the primary version of OALL became less effective over time as models achieved near-perfect scores, limiting their ability to distinguish incremental improvements. In response, the brand new leaderboard replaces these saturated benchmarks, introducing a more relevant and up-to-date suite of evaluation tasks.
To deal with these gaps, the brand new leaderboard incorporates tasks which might be natively developed in Arabic. These tasks are designed to capture the language’s distinctive features—corresponding to its wealthy morphology, subtle syntax, and context-specific usage—elements which might be often lost in translation-based benchmarks. This shift ensures that evaluations are more authentic and higher aligned with the realities of Arabic language use.
Moreover, we identified a silent bug in one in all the primary tasks, AlGhafa, which inadvertently impacted model rankings. The problem stemmed from a mismatch in how answer decisions were checked—reasonably than verifying their indices, the duty evaluated responses based on the alternatives themselves. While this was not entirely incorrect, it affected small/weak models disproportionately. Some models experienced rating drops of as much as 20 points, while stronger models remained relatively unaffected. This issue compromised the consistency, fairness and uniformity of the evaluations.
What’s latest on this version?
In reforming the leaderboard, we follow two guiding principles: remove saturated and machine translated tasks, as a consequence of inherent lower quality and possible cultural bias, and add newly available top quality native or human curated benchmarks to extend the coverage of the evaluation.
From the primary version of the Open Arabic LLM Leaderboard (OALL), we keep the next benchmark datasets:
- AlGhafa benchmark [3]: from the unique benchmark released by TII, we keep only the native Arabic datasets, namely the human-curated versions of Facts-Balanced, SOCAL, XGLUE, Sentiment, Sentiment-Rating, Sentiment-Rating-No-Neutral, the 2 Arabic tasks from Meta’s Belebele [4] (Arabic-MSA and Arabic-Dialects), and eventually the Arabic EXAMS benchmarks [5].
We enrich the leaderboard by adding the next datasets, released up to now 12 months:
- Native Arabic MMLU [6]: a native Arabic benchmark released by MBZUAI and inspired by the unique English MMLU dataset; consists of 40 tasks and almost 15,000 multiple-choice questions in Modern Standard Arabic (MSA), sourced from school exams.
- Human Translated MMLU (MMLU-HT) [7]: a human translation of the unique English MMLU dataset containing 57 tasks, curated by Inception as a part of the JAIS project, and published under the MBZUAI HF Organization.
- MedinaQA: released by MBZUAI so as to foster the adoption of more native Arabic benchmarks. This dataset focuses on general Arabic language and grammar facets.
- AraTrust [8]: a dataset comprising of 522 human-written multiple-choice questions covering different facets related to safety and truthfulness.
Finally, we introduce the ALRAGE benchmark: Arabic Language Retrieval Augmented Generation Evaluation.
It introduces a comprehensive framework for evaluating Large Language Models’ retrieval-augmented generation capabilities in Arabic. Built upon a meticulously curated dataset sourced from 40 Arabic books spanning diverse topics, from Arts & Literature to Technology & Innovation, the benchmark was created using meta-llama/Meta-Llama-3.1-70B for synthetic generation and validated by native Arabic speakers with a community sprint with Argilla. The dataset structure includes questions, ground-truth answers, candidate contexts retrieved through the BAAI/bge-m3 embedding model, and goal candidate indices, all designed to authentically simulate real-world RAG scenarios in Arabic.
The revolutionary aspect of ALRAGE lies in its evaluation methodology, which implements a LLM-as-judge metric inside the lighteval framework. Using Qwen2.5-72B-Instruct because the judge model, the system evaluates generated responses through a structured Arabic prompt that compares the model’s output against gold answers. The evaluation employs a nuanced 0-10 scoring rubric that assesses answer accuracy, relevance, and quality, with scores normalized to a 0-1 range for standardization. This technical implementation, manifested through a custom JudgeMetricWrapper class, provides a rigorous, reproducible method for evaluating Arabic language generation while maintaining sensitivity to Arabic linguistic nuances, effectively addressing the critical need for stylish evaluation metrics in Arabic NLP.
Table 1 summarizes the datasets kept from the primary version of the leaderboard in addition to the brand new datasets introduced on this second version.
| Datasets kept from OALL v1 | Datasets added for OALL v2 |
| AlGhafa (6 tasks) | Native Arabic MMLU (40 tasks) |
| EXAMS | Human Translated MMLU (57 tasks) |
| Belebele (2 tasks) | MedinaQA |
| AraTrust | |
| ALRAGE |
Besides adding and removing datasets, we fixed multiple issues related to the UI and its filters, and we also introduced chat templates. When it comes to user submissions, now the variety of submissions is restricted to five per organization per week. This limitation is supposed to limit the usage of the leaderboard and provides the prospect to varied organizations to have their models evaluated. NOTE that for the models submitted by OALL’s team to v2, if chat template is present in the config, it’s used for the evaluation. Otherwise, chat template is disabled.
Results from v1 and v2
To evaluate the impact of the second iteration of the Open Arabic LLM Leaderboard, we conducted a series of statistical comparisons between the 2 versions.
Figure 3 displays the performance scores across six benchmarks for Versions 1 and a pair of. Notably, ACVA and Toxigen display saturation effects at various model sizes. Alghafa in Version 1 exhibits lower saturation, which we hypothesize is as a consequence of the inclusion of each native and translated Arabic benchmarks. In contrast, the models’ performance for AraTrust, ALRAGE, and Alghafa form v2 is more dispersed with respect to model size.
To look at the correlation between OALL and other Arabic LLM leaderboards, we compared the relative rankings of 5 open Arabic LLMs: google/gemma-2-27b-it, CohereForAI/aya-23-35B, CohereForAI/aya-expanse-32b, inceptionai/jais-adapted-70b-chat, and meta-llama/Llama-3.3-70B-Instruct, across three leaderboards: OALL v2, SEAL Arabic, and AraGen. As illustrated in Figure 4, a correlation between the leaderboards is notable, with the Llama3.3-70-instruct model rating first on each OALL v2 and AraGen, and third on SEAL. *As a clarification, AraGen currently only features the scores for inceptionai/jais-adapted-70b-chat, and the Arabic SEAL leaderboard only includes Jais Adapted 70B, so presumably the pretrained model. As we couldn’t fully solve this discrepancy, we decided to guage inceptionai/jais-adapted-70b-chat on OALL v2 for this comparison.
To further explore the differences between the 2 versions of OALL, we present in Figure 5 the highest models across two categories: pretrained and chat. For models submitted to OALL v1, Qwen2.5 establishes itself as a robust baseline for Arabic in all categories, particularly for pretrained models. In OALL v2, Qwen models also dominate the pretrained models category, nevertheless the Qwen/Qwen2-72B model surpasses Qwen/Qwen2.5-72B as one of the best pretrained/continually pretrained model, and Llama3.3-70B-instruct emerges because the leader in all categories, surpassing calme-2.1-qwen2.5-72b in performance. Overall, some model rankings have shifted in v2, while others have remained consistent. We attribute these changes to 2 key aspects: first, the robustness of models with respect to Arabic-native benchmarks, safety, and trustworthiness; and second, the evaluation of over 700 models in OALL v1 in comparison with 80 models in v2, including a number of recent models that may not be present in v1. We anticipate that the community will contribute to expanding the leaderboard following its release.

Finally, we analyzed the common scores on OALL v1 and v2 for 2 model families: AceGPT and Jais. As depicted in Figure 6, the trend is consistent across each versions: larger models are likely to achieve higher average scores, aside from inceptionai/jais-family-30b-8k, that surpasses the larger inceptionai/jais-adapted-70b model on OALL v2. Overall, the common scores in v2 are higher than in v1, apart from the 7B models in each families. We hypothesize that this discrepancy is as a consequence of the lower performance of smaller models on ALRAGE, because it is a generative task, which usually favors larger models.
Conclusion and future work
On this blog post, we introduced the second version of the Open Arabic LLM Leaderboard. We analyzed the present Arabic leaderboards in addition to the primary version of the OALL, remarking issues corresponding to the saturation of specific benchmarks that were removed on this second iteration. We also removed machine-translated benchmarks and retained only the Arabic native and human-translated benchmarks. Finally, we introduced latest benchmarks corresponding to Aratrust, MadinaQA, native MMLU, human translated MMLU (MMLU-HT) and ALRAGE. Our goal is to supply the community an objective evaluation of Arabic LLMs, aiding within the understanding of the strengths and weaknesses of every submitted model.
Looking ahead, we hope to see the discharge of additional Arabic benchmarks, particularly in areas corresponding to mathematics, reasoning, hallucination and each general and domain-specific benchmarks.
Acknowledgments
The authors would love to thank Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) for providing among the latest native benchmarks we’re using on this version, including the brand new MMLU-HT dataset. We also extend our gratitude to TII for his or her generous sponsorship of the inference hardware needed for the evaluation backend. We also thank our friends at Hugging Face for his or her continuous support and being at all times 🤗 each time needed. Because of all people specializing in evaluation and leaderboards for his or her languages and tasks. Lastly, we thank the community for his or her engagement and worthwhile feedback on the primary version of the OALL. Looking forward to seeing many models on the leaderboard 🚀.
Citations
@misc{OALL2,
creator = {El Filali, Ali and ALOUI, Manel and Husaain, Tarique and Alzubaidi, Ahmed and Boussaha, Basma El Amel and Cojocaru, Ruxandra and Fourrier, Clémentine and Habib, Nathan and Hacid, Hakim},
title = {The Open Arabic LLM Leaderboard 2},
12 months = {2025},
publisher = {OALL},
howpublished = {https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard}
}
References
- [1] Introducing the Open Arabic LLM Leaderboard (El Filali et al., 2024)
- [2] CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks (Qian et al., 2024)
- [3] AlGhafa Evaluation Benchmark for Arabic Language Models (Almazrouei et al., ArabicNLP 2023)
- [4] The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants (Bandarkar et al., ACL, 2023)
- [5] {EXAMS}: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Query Answering” (Hardalov et al., EMNLP, 2023)
- [6] ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic(Koto et al., ACL, 2024)
- [7] Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models (Sengupta et al., 2023)
- [8] AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic (Alghamdi et al., 2024)
- [9] LightEval: A light-weight framework for LLM evaluation (Fourrier et al., 2023)
