Introducing the Open Arabic LLM Leaderboard

The Open Arabic LLM Leaderboard (OALL) is designed to deal with the growing need for specialised benchmarks within the Arabic language processing domain. As the sphere of Natural Language Processing (NLP) progresses, the main target often stays heavily skewed towards English, leaving a big gap in resources for other languages. The OALL goals to balance this by providing a platform specifically for evaluating and comparing the performance of Arabic Large Language Models (LLMs), thus promoting research and development in Arabic NLP.

This initiative is especially significant on condition that it directly serves over 380 million Arabic speakers worldwide. By enhancing the power to accurately evaluate and improve Arabic LLMs, we hope the OALL will play a vital role in developing models and applications which can be finely tuned to the nuances of the Arabic language, culture and heritage.

Benchmarks, Metrics & Technical setup

Benchmark Datasets

The Open Arabic LLM Leaderboard (OALL) utilizes an intensive and diverse collection of strong datasets to make sure comprehensive model evaluation.

AlGhafa benchmark: created by the TII LLM team with the goal of evaluating models on a variety of abilities including reading comprehension, sentiment evaluation, and query answering. It was initially introduced with 11 native Arabic datasets and was later prolonged to incorporate a further 11 datasets which can be translations of other widely adopted benchmarks inside the English NLP community.
ACVA and AceGPT benchmarks: feature 58 datasets from the paper “AceGPT, Localizing Large Language Models in Arabic”, and translated versions of the MMLU and EXAMS benchmarks to broaden the evaluation spectrum and canopy a comprehensive range of linguistic tasks. These benchmarks are meticulously curated and have various subsets that precisely capture the complexities and subtleties of the Arabic language.

Evaluation Metrics

Given the character of the tasks, which include multiple-choice and yes/no questions, the leaderboard primarily uses normalized log likelihood accuracy for all tasks. This metric was chosen for its ability to supply a transparent and fair measurement of model performance across various kinds of questions.

Technical setup

The technical setup for the Open Arabic LLM Leaderboard (OALL) uses:

front- and back-ends inspired by the demo-leaderboard, with the back-end running locally on the TII cluster
the lighteval library to run the evaluations. Significant contributions have been made to integrate the Arabic benchmarks discussed above into lighteval, to support out-of-the-box evaluations of Arabic models for the community (see PR #44 and PR #95 on GitHub for more details).

Future Directions

We now have many ideas about expanding the scope of the Open Arabic LLM Leaderboard. Plans are in place to introduce additional leaderboards under various categories, comparable to one for evaluating Arabic LLMs in Retrieval Augmented Generation (RAG) scenarios and one other as a chatbot arena that calculates the ELO scores of various Arabic chatbots based on user preferences.

Moreover, we aim to increase our benchmarks to cover more comprehensive tasks by developing the OpenDolphin benchmark, which is able to include about 50 datasets and shall be an open replication of the work done by Nagoudi et al. within the paper titled “Dolphin: A Difficult and Diverse Benchmark for Arabic NLG”. For those interested by adding their benchmarks or collaborating on the OpenDolphin project, please contact us through the discussion tab or at this email address.

We’d like to welcome your contribution on these points! We encourage the community to contribute by submitting models, suggesting recent benchmarks, or participating in discussions. We also encourage the community to utilize the highest models of the present leaderboard to create recent models through finetuning or another techniques which may help your model to climb the ranks to the primary place! You may be the subsequent Arabic Open Models Hero!

We hope the OALL will encourage technological advancements and highlight the unique linguistic and cultural characteristics inherent to the Arabic language, and that our technical setup and learnings from deploying a large-scale, language-specific leaderboard might be helpful for similar initiatives in other underrepresented languages. This focus will help bridge the gap in resources and research, traditionally dominated by English-centric models, enriching the worldwide NLP landscape with more diverse and inclusive tools, which is crucial as AI technologies turn out to be increasingly integrated into on a regular basis life around the globe.

Submit Your Model !

Model Submission Process

To make sure a smooth evaluation process, participants must adhere to specific guidelines when submitting models to the Open Arabic LLM Leaderboard:

Ensure Model Precision Alignment: It’s critical that the precision of the submitted models aligns with that of the unique models. Discrepancies in precision may end in the model being evaluated but not properly displayed on the leaderboard.
Pre-Submission Checks:
- Load Model and Tokenizer: Confirm that your model and tokenizer might be successfully loaded using AutoClasses. Use the next commands:
```
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
  When you encounter errors, address them by following the error messages to make sure your model has been accurately uploaded.
- Model Visibility: Be sure that your model is about to public visibility. Moreover, note that in case your model requires use_remote_code=True, this feature just isn’t currently supported but is under development.
Convert Model Weights to Safetensors:
- Convert your model weights to safetensors, a safer and faster format for loading and using weights. This conversion also enables the inclusion of the model’s parameter count within the Prolonged Viewer.
License and Model Card:
- Open License: Confirm that your model is openly licensed. This leaderboard promotes the accessibility of open LLMs to make sure widespread usability.
- Complete Model Card: Populate your model card with detailed information. This data shall be routinely extracted and displayed alongside your model on the leaderboard.

In Case of Model Failure

In case your model appears within the ‘FAILED’ category, this means that execution was halted. Review the steps outlined above to troubleshoot and resolve any issues. Moreover, test the next script in your model locally to substantiate its functionality before resubmitting.

Acknowledgements

We extend our gratitude to all contributors, partners, and sponsors, particularly the Technology Innovation Institute and Hugging Face for his or her substantial support on this project. TII has provided generously the essential computational resources, consistent with their commitment to supporting community-driven projects and advancing open science inside the Arabic NLP field, whereas Hugging Face has assisted with the combination and customization of their recent evaluation framework and leaderboard template.

We might also like to specific our due to Upstage for his or her work on the Open Ko-LLM Leaderboard, which served as a priceless reference and source of inspiration for our own efforts. Their pioneering contributions have been instrumental in guiding our approach to developing a comprehensive and inclusive Arabic LLM leaderboard.

Citations and References

@misc{OALL,
  creator = {El Filali, Ali and Alobeidli, Hamza and Fourrier, Clémentine and Boussaha, Basma El Amel and Cojocaru, Ruxandra and Habib, Nathan and Hacid, Hakim},
  title = {Open Arabic LLM Leaderboard},
  12 months = {2024},
  publisher = {OALL},
  howpublished = "url{https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard}"
}

@inproceedings{almazrouei-etal-2023-alghafa,
    title = "{A}l{G}hafa Evaluation Benchmark for {A}rabic Language Models",
    creator = "Almazrouei, Ebtesam  and
      Cojocaru, Ruxandra  and
      Baldo, Michele  and
      Malartic, Quentin  and
      Alobeidli, Hamza  and
      Mazzotta, Daniele  and
      Penedo, Guilherme  and
      Campesan, Giulia  and
      Farooq, Mugariya  and
      Alhammadi, Maitha  and
      Launay, Julien  and
      Noune, Badreddine",
    editor = "Sawaf, Hassan  and
      El-Beltagy, Samhaa  and
      Zaghouani, Wajdi  and
      Magdy, Walid  and
      Abdelali, Ahmed  and
      Tomeh, Nadi  and
      Abu Farha, Ibrahim  and
      Habash, Nizar  and
      Khalifa, Salam  and
      Keleg, Amr  and
      Haddad, Hatem  and
      Zitouni, Imed  and
      Mrini, Khalil  and
      Almatham, Rawan",
    booktitle = "Proceedings of ArabicNLP 2023",
    month = dec,
    12 months = "2023",
    address = "Singapore (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.arabicnlp-1.21",
    doi = "10.18653/v1/2023.arabicnlp-1.21",
    pages = "244--275",
    abstract = "Recent advances within the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and constantly increasing NLP resources, the Arabic LLM landscape has improved in a really short span of time, despite being tormented by training data scarcity and limited evaluation resources in comparison with English. In step with contributing towards this ever-growing field, we introduce AlGhafa, a brand new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a brand new suite of models, including a 14 billion parameter model, the biggest monolingual Arabic decoder-only model up to now. We use a set of publicly available datasets, in addition to a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.",
}
@misc{huang2023acegpt,
      title={AceGPT, Localizing Large Language Models in Arabic}, 
      creator={Huang Huang and Fei Yu and Jianqing Zhu and Xuening Sun and Hao Cheng and Dingjie Song and Zhihong Chen and Abdulmohsen Alharthi and Bang An and Ziche Liu and Zhiyi Zhang and Junying Chen and Jianquan Li and Benyou Wang and Lian Zhang and Ruoyu Sun and Xiang Wan and Haizhou Li and Jinchao Xu},
      12 months={2023},
      eprint={2309.12053},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{lighteval,
  creator = {Fourrier, Clémentine and Habib, Nathan and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A light-weight framework for LLM evaluation},
  12 months = {2023},
  version = {0.3.0},
  url = {https://github.com/huggingface/lighteval}
}

Source link

Introducing the Open Arabic LLM Leaderboard

Benchmarks, Metrics & Technical setup

Benchmark Datasets

Evaluation Metrics

Technical setup

Future Directions

Submit Your Model !

Model Submission Process

In Case of Model Failure

Acknowledgements

Citations and References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Study: AI chatbots provide less-accurate information to vulnerable users

Millisecond Latency using Hugging Face Infinity and modern CPUs

The Missing Curriculum: Essential Concepts For Data Scientists within the Age of AI Coding Agents

Welcome Stable-baselines3 to the Hugging Face Hub 🤗

Exposing biases, moods, personalities, and abstract concepts hidden in large language models

Introducing the Open Arabic LLM Leaderboard

Benchmarks, Metrics & Technical setup

Benchmark Datasets

Evaluation Metrics

Technical setup

Future Directions

Submit Your Model !

Model Submission Process

In Case of Model Failure

Acknowledgements

Citations and References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.