The Open Arabic LLM Leaderboard (OALL) is designed to deal with the growing need for specialised benchmarks within the Arabic language processing domain. As the sphere of Natural Language Processing (NLP) progresses, the main target often stays heavily skewed towards English, leaving a big gap in resources for other languages. The OALL goals to balance this by providing a platform specifically for evaluating and comparing the performance of Arabic Large Language Models (LLMs), thus promoting research and development in Arabic NLP.
This initiative is especially significant on condition that it directly serves over 380 million Arabic speakers worldwide. By enhancing the power to accurately evaluate and improve Arabic LLMs, we hope the OALL will play a vital role in developing models and applications which can be finely tuned to the nuances of the Arabic language, culture and heritage.
Benchmarks, Metrics & Technical setup
Benchmark Datasets
The Open Arabic LLM Leaderboard (OALL) utilizes an intensive and diverse collection of strong datasets to make sure comprehensive model evaluation.
- AlGhafa benchmark: created by the TII LLM team with the goal of evaluating models on a variety of abilities including reading comprehension, sentiment evaluation, and query answering. It was initially introduced with 11 native Arabic datasets and was later prolonged to incorporate a further 11 datasets which can be translations of other widely adopted benchmarks inside the English NLP community.
- ACVA and AceGPT benchmarks: feature 58 datasets from the paper “AceGPT, Localizing Large Language Models in Arabic”, and translated versions of the MMLU and EXAMS benchmarks to broaden the evaluation spectrum and canopy a comprehensive range of linguistic tasks. These benchmarks are meticulously curated and have various subsets that precisely capture the complexities and subtleties of the Arabic language.
Evaluation Metrics
Given the character of the tasks, which include multiple-choice and yes/no questions, the leaderboard primarily uses normalized log likelihood accuracy for all tasks. This metric was chosen for its ability to supply a transparent and fair measurement of model performance across various kinds of questions.
Technical setup
The technical setup for the Open Arabic LLM Leaderboard (OALL) uses:
- front- and back-ends inspired by the
demo-leaderboard, with the back-end running locally on the TII cluster - the
lightevallibrary to run the evaluations. Significant contributions have been made to integrate the Arabic benchmarks discussed above intolighteval, to support out-of-the-box evaluations of Arabic models for the community (see PR #44 and PR #95 on GitHub for more details).
Future Directions
We now have many ideas about expanding the scope of the Open Arabic LLM Leaderboard. Plans are in place to introduce additional leaderboards under various categories, comparable to one for evaluating Arabic LLMs in Retrieval Augmented Generation (RAG) scenarios and one other as a chatbot arena that calculates the ELO scores of various Arabic chatbots based on user preferences.
Moreover, we aim to increase our benchmarks to cover more comprehensive tasks by developing the OpenDolphin benchmark, which is able to include about 50 datasets and shall be an open replication of the work done by Nagoudi et al. within the paper titled “Dolphin: A Difficult and Diverse Benchmark for Arabic NLG”. For those interested by adding their benchmarks or collaborating on the OpenDolphin project, please contact us through the discussion tab or at this email address.
We’d like to welcome your contribution on these points! We encourage the community to contribute by submitting models, suggesting recent benchmarks, or participating in discussions. We also encourage the community to utilize the highest models of the present leaderboard to create recent models through finetuning or another techniques which may help your model to climb the ranks to the primary place! You may be the subsequent Arabic Open Models Hero!
We hope the OALL will encourage technological advancements and highlight the unique linguistic and cultural characteristics inherent to the Arabic language, and that our technical setup and learnings from deploying a large-scale, language-specific leaderboard might be helpful for similar initiatives in other underrepresented languages. This focus will help bridge the gap in resources and research, traditionally dominated by English-centric models, enriching the worldwide NLP landscape with more diverse and inclusive tools, which is crucial as AI technologies turn out to be increasingly integrated into on a regular basis life around the globe.
Submit Your Model !
Model Submission Process
To make sure a smooth evaluation process, participants must adhere to specific guidelines when submitting models to the Open Arabic LLM Leaderboard:
-
Ensure Model Precision Alignment: It’s critical that the precision of the submitted models aligns with that of the unique models. Discrepancies in precision may end in the model being evaluated but not properly displayed on the leaderboard.
-
Pre-Submission Checks:
-
Load Model and Tokenizer: Confirm that your model and tokenizer might be successfully loaded using AutoClasses. Use the next commands:
from transformers import AutoConfig, AutoModel, AutoTokenizer config = AutoConfig.from_pretrained("your model name", revision=revision) model = AutoModel.from_pretrained("your model name", revision=revision) tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)When you encounter errors, address them by following the error messages to make sure your model has been accurately uploaded.
-
Model Visibility: Be sure that your model is about to public visibility. Moreover, note that in case your model requires
use_remote_code=True, this feature just isn’t currently supported but is under development.
-
-
Convert Model Weights to Safetensors:
- Convert your model weights to safetensors, a safer and faster format for loading and using weights. This conversion also enables the inclusion of the model’s parameter count within the
Prolonged Viewer.
- Convert your model weights to safetensors, a safer and faster format for loading and using weights. This conversion also enables the inclusion of the model’s parameter count within the
-
License and Model Card:
- Open License: Confirm that your model is openly licensed. This leaderboard promotes the accessibility of open LLMs to make sure widespread usability.
- Complete Model Card: Populate your model card with detailed information. This data shall be routinely extracted and displayed alongside your model on the leaderboard.
In Case of Model Failure
In case your model appears within the ‘FAILED’ category, this means that execution was halted. Review the steps outlined above to troubleshoot and resolve any issues. Moreover, test the next script in your model locally to substantiate its functionality before resubmitting.
Acknowledgements
We extend our gratitude to all contributors, partners, and sponsors, particularly the Technology Innovation Institute and Hugging Face for his or her substantial support on this project. TII has provided generously the essential computational resources, consistent with their commitment to supporting community-driven projects and advancing open science inside the Arabic NLP field, whereas Hugging Face has assisted with the combination and customization of their recent evaluation framework and leaderboard template.
We might also like to specific our due to Upstage for his or her work on the Open Ko-LLM Leaderboard, which served as a priceless reference and source of inspiration for our own efforts. Their pioneering contributions have been instrumental in guiding our approach to developing a comprehensive and inclusive Arabic LLM leaderboard.
Citations and References
@misc{OALL,
creator = {El Filali, Ali and Alobeidli, Hamza and Fourrier, Clémentine and Boussaha, Basma El Amel and Cojocaru, Ruxandra and Habib, Nathan and Hacid, Hakim},
title = {Open Arabic LLM Leaderboard},
12 months = {2024},
publisher = {OALL},
howpublished = "url{https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard}"
}
@inproceedings{almazrouei-etal-2023-alghafa,
title = "{A}l{G}hafa Evaluation Benchmark for {A}rabic Language Models",
creator = "Almazrouei, Ebtesam and
Cojocaru, Ruxandra and
Baldo, Michele and
Malartic, Quentin and
Alobeidli, Hamza and
Mazzotta, Daniele and
Penedo, Guilherme and
Campesan, Giulia and
Farooq, Mugariya and
Alhammadi, Maitha and
Launay, Julien and
Noune, Badreddine",
editor = "Sawaf, Hassan and
El-Beltagy, Samhaa and
Zaghouani, Wajdi and
Magdy, Walid and
Abdelali, Ahmed and
Tomeh, Nadi and
Abu Farha, Ibrahim and
Habash, Nizar and
Khalifa, Salam and
Keleg, Amr and
Haddad, Hatem and
Zitouni, Imed and
Mrini, Khalil and
Almatham, Rawan",
booktitle = "Proceedings of ArabicNLP 2023",
month = dec,
12 months = "2023",
address = "Singapore (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.arabicnlp-1.21",
doi = "10.18653/v1/2023.arabicnlp-1.21",
pages = "244--275",
abstract = "Recent advances within the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and constantly increasing NLP resources, the Arabic LLM landscape has improved in a really short span of time, despite being tormented by training data scarcity and limited evaluation resources in comparison with English. In step with contributing towards this ever-growing field, we introduce AlGhafa, a brand new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a brand new suite of models, including a 14 billion parameter model, the biggest monolingual Arabic decoder-only model up to now. We use a set of publicly available datasets, in addition to a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.",
}
@misc{huang2023acegpt,
title={AceGPT, Localizing Large Language Models in Arabic},
creator={Huang Huang and Fei Yu and Jianqing Zhu and Xuening Sun and Hao Cheng and Dingjie Song and Zhihong Chen and Abdulmohsen Alharthi and Bang An and Ziche Liu and Zhiyi Zhang and Junying Chen and Jianquan Li and Benyou Wang and Lian Zhang and Ruoyu Sun and Xiang Wan and Haizhou Li and Jinchao Xu},
12 months={2023},
eprint={2309.12053},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{lighteval,
creator = {Fourrier, Clémentine and Habib, Nathan and Wolf, Thomas and Tunstall, Lewis},
title = {LightEval: A light-weight framework for LLM evaluation},
12 months = {2023},
version = {0.3.0},
url = {https://github.com/huggingface/lighteval}
}
