This project addresses the critical need for advancement in Hebrew NLP. As Hebrew is taken into account a low-resource language, existing LLM leaderboards often lack benchmarks that accurately reflect its unique characteristics. Today, we’re excited to introduce a pioneering effort to alter this narrative — our latest open LLM leaderboard, specifically designed to judge and enhance language models in Hebrew.
Hebrew is a morphologically wealthy language with a posh system of roots and patterns. Words are built from roots with prefixes, suffixes, and infixes used to change meaning, tense, or form plurals (amongst other functions). This complexity can result in the existence of multiple valid word forms derived from a single root, making traditional tokenization strategies, designed for morphologically simpler languages, ineffective. Because of this, existing language models may struggle to accurately process and understand the nuances of Hebrew, highlighting the necessity for benchmarks that cater to those unique linguistic properties.
LLM research in Hebrew due to this fact needs dedicated benchmarks that cater specifically to the nuances and linguistic properties of the language. Our leaderboard is ready to fill this void by providing robust evaluation metrics on language-specific tasks, and promoting an open community-driven enhancement of generative language models in Hebrew.
We imagine this initiative might be a platform for researchers and developers to share, compare, and improve Hebrew LLMs.
Leaderboard Metrics and Tasks
We’ve got developed 4 key datasets, each designed to check language models on their understanding and generation of Hebrew, regardless of their performance in other languages. These benchmarks use a few-shot prompt format to judge the models, ensuring that they’ll adapt and respond appropriately even with limited context.
Below is a summary of every of the benchmarks included within the leaderboard. For a more comprehensive breakdown of every dataset, scoring system, prompt construction, please visit the About tab of our leaderboard.
-
Hebrew Query Answering: This task evaluates a model’s ability to know and process information presented in Hebrew, specializing in comprehension and the accurate retrieval of answers based on context. It checks the model’s grasp of Hebrew syntax and semantics through direct question-and-answer formats.
- Source: HeQ dataset’s test subset.
-
Sentiment Accuracy: This benchmark tests the model’s ability to detect and interpret sentiments in Hebrew text. It assesses the model’s capability to categorise statements accurately as positive, negative, or neutral based on linguistic cues.
-
Winograd Schema Challenge: The duty is designed to measure the model’s understanding of pronoun resolution and contextual ambiguity in Hebrew. It tests the model’s ability to make use of logical reasoning and general world knowledge to disambiguate pronouns appropriately in complex sentences.
-
Translation: This task assesses the model’s proficiency in translating between English and Hebrew. It evaluates the linguistic accuracy, fluency, and the flexibility to preserve meaning across languages, highlighting the model’s capability in bilingual translation tasks.
Technical Setup
The leaderboard is inspired by the Open LLM Leaderboard, and uses the Demo Leaderboard template. Models which might be submitted are deployed mechanically using HuggingFace’s Inference Endpoints and evaluated through API requests managed by the lighteval library.
The implementation was straightforward, with the fundamental task being to establish the environment; the remaining of the code ran easily.
Engage with Us
We invite researchers, developers, and enthusiasts to take part in this initiative. Whether you are thinking about submitting your model for evaluation or joining the discussion on improving Hebrew language technologies, your contribution is crucial. Visit the submission page on the leaderboard for guidelines on methods to submit models for evaluation, or join the discussion page on the leaderboard’s HF space.
This latest leaderboard isn’t only a benchmarking tool; we hope it should encourage the Israeli tech community to acknowledge and address the gaps in language technology research for Hebrew. By providing detailed, specific evaluations, we aim to catalyze the event of models that will not be only linguistically diverse but additionally culturally accurate, paving the best way for innovations that honor the richness of the Hebrew language.
Join us on this exciting journey to reshape the landscape of language modeling!
Sponsorship
The leaderboard is proudly sponsored by DDR&D IMOD / The Israeli National Program for NLP in Hebrew and Arabic in collaboration with DICTA: The Israel Center for Text Evaluation and Webiks, a testament to the commitment towards advancing language technologies in Hebrew. We would really like to increase our gratitude to Prof. Reut Tsarfaty from Bar-Ilan University for her scientific consultation and guidance.
