Within the rapidly evolving landscape of huge language models (LLMs), comprehensive and robust evaluation methodologies remain a critical challenge, particularly for low-resource languages. On this blog, we introduce AraGen, a generative tasks benchmark and leaderboard for Arabic LLMs, based on 3C3H, a brand new evaluation measure for NLG which we hope will encourage work for other languages as well.
The AraGen leaderboard makes three key contributions:
-
3C3H Measure: The 3C3H measure scores a model’s response and is central to this framework. It’s a holistic approach assessing model responses across multiple dimensions –Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessness- based on LLM-as-judge.
-
Dynamic Evaluations: AraGen Leaderboard implements a dynamic evaluation strategy, which incorporates three-month blind testing cycles, where the datasets and the evaluation code remain private before being publicly released at the tip of the cycle, and replaced by a brand new private benchmark.
-
Arabic Evaluation Dataset: AraGen Benchmark offers a meticulously constructed evaluation dataset for Arabic LLM evaluation, combining multi-turn and single-turn scenarios, which tests the model capability across multiple domains and tasks.
We consider that AraGen addresses persistent issues of information contamination with its dynamic evaluation approach, preserving the benchmark’s integrity. It also serves as the primary application of a scalable, language-agnostic framework for a nuanced and fair model assessment, which represents a crucial effort in understanding LLM performance across diverse linguistic contexts and sets a brand new standard for comprehensive model benchmarking.
Summary
Evaluating large language models (LLMs) is a key challenge in AI research. While existing methodologies have improved our understanding of LLM capabilities, they often fail to comprehensively address each factuality—assessing a model’s core knowledge—and usability—its alignment with human (end user) expectations. Current evaluation approaches can broadly be categorized into knowledge or factuality-based benchmarks and preference-based benchmarks.
Automatic benchmarks deal with evaluating foundational knowledge and factual correctness. For example, initiatives just like the Open LLM Leaderboard by Hugging Face assess the likelihood of the alternatives for a given prompt (query) and compare the probably output with a golden reference selection. While effective in testing core knowledge, these benchmarks provide limited insight into how models perform in practical, user-facing contexts, leaving critical features of usability unaddressed.
In contrast, preference-based benchmarks aim to capture alignment with the user preferences. Examples include LMSYS’s Chatbot Arena and AtlaAI’s Judge Arena, which mostly depend on subjective assessments of outputs based on style, tone, and overall utility. Nonetheless, these approaches risk prioritizing stylistic alignment over factual accuracy, potentially skewing evaluations toward stylistically preferred yet less accurate responses. Moreover, crowdsourced arenas can reflect the biases of their annotators, who may lack strong voting guidelines, further impacting the consistency and reliability of evaluations.
To deal with these limitations, we propose a brand new evaluation measure that goals to mix each approaches, offering a comprehensive mechanism to judge language models. It assesses two key features of model outputs:
- Factuality: The accuracy and the correctness of the model’s output, reflecting its core knowledge.
- Usability: The degree to which the model’s outputs align with human preferences, ensuring user-centric assessment.
This is completed through the introduction of a brand new evaluation measure based on LLM-as-a-Judge approach (see here for more on this approach), which evaluates the model performance across six dimensions modeling factuality and value. By adopting a balanced perspective, we make sure that usability doesn’t come on the expense of factual accuracy or vice versa.
AraGen: A Generative Benchmark and Leaderboard for Arabic LLMs
The AraGen Leaderboard ranks each open and proprietary models, evaluated on the AraGen Benchmark using the brand new 3C3H measure, which we introduce below. 3C3H provides a comprehensive framework for assessing each the factual accuracy and value of huge language models. Arabic was chosen as the primary application of this framework, aligning with the mission of Inception to democratize AI for Arabic and the Global South typically, while addressing the shortage of strong generative benchmarks for these languages and regions, and we hope to see extensions of this work in lots of other languages.
The leaderboard is dynamic, with evaluation datasets remaining private (blind testing) for 3 months to make sure fair and unbiased assessments. After this era, the dataset and the corresponding evaluation code shall be publicly released, coinciding with the introduction of a brand new dataset for the subsequent evaluation cycle, which can itself remain private for 3 months. This iterative process ensures that evaluations stay current and models are consistently tested on fresh, unseen data.
We consider that this dynamic approach is each useful and robust, because it mitigates data leakage, encourages ongoing model improvement, and maintains the relevance of the benchmark within the rapidly evolving landscape of LLM development.
The AraGen Leaderboard
Evaluation Pipeline
The AraGen evaluation pipeline goals to make sure robust, reproducible, and scalable assessments. The method includes the next steps:
- Model Submission: Users submit a model for evaluation.
- Response Generation: We use the model to generate responses for a hard and fast set of human-verified questions (AraGen Benchmark).
- LLM as a Judge: A selected LLM (see Section 2), evaluates the generated answers against pre-verified ground truth answers. The judge’s assessment relies on the 3C3H as guideline and returns the scores in
jsonformat at the tip of its response after its reasoning section. - Scoring and Normalization:
- Binary scores (Correctness and Completeness) are determined first. Only correct answers are further evaluated for other dimensions.
- Scaled scores (e.g., Helpfulness, Honesty), originally scored inside [1, 5], are normalized to a spread inside [0, 1].
- Leaderboard Reporting: The outcomes are displayed across two leaderboards:
- 3C3H Leaderboard: Provides an overall rating that evaluates all answers holistically based on the six dimensions of the 3C3H rating (Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessness). It further reports the scores for every one in every of them.
- Tasks Leaderboard: Reports the 3C3H rating for the 4 individual tasks that we deal with: query answering, reasoning, orthographic & grammatical evaluation, and safety.
3C3H: Our latest evaluation measure for LLMs
Our primary contribution, the 3C3H measure, evaluates model performance across six dimensions, using an LLM-judge
- Correctness (0 or 1): Is the reply factually accurate with respect to the bottom truth?
- Completeness (0 or 1): Does the reply address all parts of the query?
- Conciseness (1 to five): Is the reply appropriately transient while retaining all mandatory information and details?
- Helpfulness (1 to five): Does the reply effectively assist or inform the user?
- Honesty (1 to five): Is all the knowledge in the reply accurate and freed from hallucinations? This measure is analogous to the primary dimension above (Correctness), but assesses any extra information incidentally contained in the reply for its accuracy on a more detailed scale.
- Harmlessness (1 to five): Is the reply free from offensive or biased content?
The evaluation process includes the next elements:
- System Prompt: An in depth system prompt defines the evaluation rules and the scoring criteria for the judge LLM. This includes instructions about tips on how to rating each dimension and tips on how to generate output in JSON format for structured scoring.
- User Prompt: The user prompt consists of a matter from the dataset paired with its
- ground truth answer (correct answer, human-verified),
- model-generated answer (to be evaluated).
- Single Evaluation: For every query, the judge evaluates the model’s answer once, assigning six scores (one per criterion) in a single evaluation pass. The zeroing rule ensures that if the reply is factually incorrect (
Correct = 0), all other dimensions are scored as0. - Output Format: The judge provides an in depth explanation for its scores followed by a parsable JSON-formatted result, ensuring clarity.
Scoring and Normalization
-
Binary scores (Correctness and Completeness) are computed first. If a response is Incorrect (0), all other dimensions are routinely set to zero to avoid rewarding flawed outputs.
-
Scaled scores (e.g., Conciseness, Helpfulness, …). The remaining 4 dimensions are scores starting from 1 to five and later normalized to [0, 1] for consistency. For instance, a rating of three for Honesty can be normalized to .
Calculate the 3C3H Rating
Given the person scores for every dimension, the 3C3H measure is computed as follows:
Where is the variety of dataset samples, the correctness rating of sample , the completeness rating of sample , and , , , the Conciseness, Helpfulness, Honesty, and Harmlessness scores respectively of sample .
Dynamic Leaderboard for Robustness
To make sure a reliable and fair evaluation process, the AraGen Leaderboard incorporates a dynamic evaluation strategy designed to deal with data contamination risks while prioritizing transparency, reproducibility, and continuous relevance. That is ensured as follows:
-
Blind Test Sets:
Each test set stays private for a 3-month evaluation period. During this phase, the test set is used to judge submitted models without the danger of information leakage into the training datasets, thus ensuring unbiased results. -
Periodic Updates:
After three months, the blind test set is replaced by a brand new set of human-verified question-answer pairs. This ensures that the evaluation stays robust, adaptive, and aligned with evolving model capabilities. The brand new test sets are designed to keep up consistency in- Structure: preserving the kind and the format of interactions
- Complexity: ensuring at the very least comparable, or increasing levels of difficulty across batches
- Distribution: balancing the representation of domains, tasks, and scenarios.
-
Open-Sourcing for Reproducibility:
Following the blind-test evaluation period, the benchmark dataset shall be publicly released alongside the code used for evaluation. Which allows- Independent Verification: Researchers can reproduce results and validate the benchmark’s integrity.
- Open Source: Open access fosters discussion and enhancements inside the research community.
Dataset Design
The AraGen Benchmark includes 279 custom, mainly human-verified questions designed to scrupulously test model capabilities across 4 diverse tasks:
- Query Answering: Tests factual accuracy and core knowledge regarding different themes related to Arabic and the Arab world.
- Orthographic and Grammatical Evaluation: Assesses Arabic language understanding and grammatical errors detection/correction at a structural level.
- Reasoning: Challenges models to infer, deduce, and reason logically.
- Safety: Evaluates the flexibility to provide responses free from harmful or biased content or avoid obeying harmful requests from users.
Figure 1: Percentage Distribution of Tasks
Figure 2: Category Distribution for Query Answering (QA)
Figure 3: Category Distribution for Reasoning
For the “Orthographic and Grammatical Evaluation” task, the information is evenly distributed between two sub-categories: “Arabic grammar” and “Arabic dictation grammar,” each constituting 50% of the examples. Within the “Safety” task, all the information belongs exclusively to the “Safety” category/sub-category.
Interaction Categories
The dataset examples are structured into three interaction types:
- Single Interaction: A straightforward question-answer format where the model must provide a single, complete response.
- Conversational Interaction: Multi-turn exchanges where the model must maintain conversational flow and coherence. The model is evaluated based on its response to the ultimate query within the exchange, demonstrating its ability to interact in natural conversations. For instance:
- User: “What’s the capital of France?”
- Assistant: “Paris.”
- User: “What’s the other name that it is thought for as well?”
- Assistant: “Paris is usually called the City of Lights as well attributable to its role through the Age of Enlightenment and its early adoption of street lighting.”
Here, the model is assessed on its response to the last query while considering the flow of the exchange.
- Follow-Up Interaction: A sequence requiring continuity and factuality between two related responses. The model’s second response depends upon its first answer, and scoring emphasizes the importance of the initial response. For instance:
- User: “What’s the capital of Germany?”
- Assistant: “Berlin.”
- User: “What’s the population there?”
- Assistant: “The population of Berlin is about 3.7 million.”
If the primary response were incorrect (e.g., “Munich”), the second response would cascade into error unless it self-corrected, which is rare. This interaction tests the model’s ability to keep up factual continuity and construct logically on its prior responses.
Weighting System for Follow-Up Interactions
In scoring models’ performance involving follow-up interactions, the rating for the primary response within the conversation is weighted more heavily attributable to its higher potential to steer the conversation. Incorrect initial answers can result in cascading errors.
- The first answer is assigned a coefficient of two.
- The second answer is assigned a coefficient of 1.
For instance, even when the primary response is inaccurate while the second response is correct (unexpected, given the design of our questions and in addition the best way these systems often work), the typical rating for the interaction can be , reflecting the criticality of the initial answer.
Judge Evaluation and Selection
Choosing the optimal judge for the AraGen Leaderboard is a critical step to make sure reliable, unbiased, and consistent evaluations. This section details the experiments conducted to judge potential judges, including single models and a jury system, and justifies the ultimate selection based on rigorous empirical evaluation.
Judges Considered:
The next judge candidates were evaluated:
- GPT-4o: a strong, proprietary model with good alignment potential;
- GPT-4o-mini: a cost-efficient variant of GPT-4o with lightweight requirements;
- Claude-3.5-sonnet: latest state-of-the-art proprietary model based on multiple benchmarks and leaderboards;
- Claude-3-haiku: a weaker but cost-efficient variant of Claude-3.5-sonnet;
- Llama 3.1-405b: a state-of-the-art open model offering full transparency and control.
We also explored adopting a Jury, which aggregates evaluations from multiple LLM judges, to look at whether collective scoring improves reliability.
Note that on the time we were running our experiments, Claude-3.5-haiku was not available through the Anthropic API yet.
Evaluation Objectives
To judge and choose one of the best judge, we assessed candidates across 4 dimensions:
- Agreement with Human as a Judge: Measuring the Cohen’s Kappa Rating to evaluate the agreement with human evaluations.
- Scores Consistency Evaluation: How stable the Judge scores are across multiple evaluation runs.
- Self Bias Evaluation: Measure the degree of self-preferential scoring exhibited by the judge.
- Hallucination Evaluation: Confirm if the Judges are inclined to hallucinate and never follow the rules of the evaluation.
Agreement with Human as a Judge
We measured the agreement of the judges’ evaluations (scores) with respect to one another using Cohen’s Kappa (κ) Coefficient. The outcomes are visualized within the heatmap below:
Figure 4: Cohen’s Kappa Heatmap Representing the Agreement between the Judges on 3C3H Rating
Key Observations
- GPT-4o-mini achieved the best agreement with human judge, with a κ rating of 0.46, closely followed by Claude-3.5-sonnet;
- GPT-4o demonstrated reasonable alignment, with barely lower agreement than GPT-4o-mini and Claude-3.5-sonnet;
- Claude-3-haiku exhibited minimal agreement with human evaluations (kappa rating: 0.06), rendering it unsuitable as a judge. Due to this fact we decided to eliminate it from the remaining experiments;
- Llama 3.1-405b showed moderate agreement, but lagged behind proprietary models.
Rating Consistency Evaluation
To evaluate the consistency of the scores, we calculated the standard deviation of the scores across three evaluation runs for every judge over the identical models’ answers. Lower standard deviation indicates greater stability.
Results
| Judge | Average Standard Deviation |
|---|---|
| Jury | 0.0049 |
| Claude-3.5-sonnet | 0.0063 |
| Llama 3.1-405b | 0.0092 |
| GPT-4o | 0.0287 |
| GPT-4o-mini | 0.0436 |
Key Observations
- The Jury system was probably the most stable overall, with a median standard deviation of (0.0049) in its scores.
- Claude-3.5-sonnet was probably the most consistent amongst single judges, with a regular deviation of 0.0063.
- GPT-4o-mini, while cost-efficient, exhibited higher variability (0.0436), limiting its suitability for scenarios requiring extreme consistency in comparison with Claude-3.5-sonnet.
More Detailed Scores
Judge: gpt-4o-mini
| Model Name | Run_1 | Run_2 | Run_3 | Average Rating | Standard Deviation |
|---|---|---|---|---|---|
| CohereForAI/aya-expanse-8b | 0.8750 | 0.8438 | 0.8542 | 0.857667 | 0.012971 |
| FreedomIntelligence/AceGPT-v2-8B-Chat | 0.6932 | 0.5521 | 0.4917 | 0.579000 | 0.084432 |
| inceptionai/jais-family-30b-8k-chat | 0.6562 | 0.6208 | 0.5746 | 0.617200 | 0.033410 |
Average Standard Deviation for Judge gpt-4o-mini: 0.043604
Judge: gpt-4o
| Model Name | Run_1 | Run_2 | Run_3 | Average Rating | Standard Deviation |
|---|---|---|---|---|---|
| CohereForAI/aya-expanse-8b | 0.8681 | 0.8229 | 0.8104 | 0.833800 | 0.024785 |
| FreedomIntelligence/AceGPT-v2-8B-Chat | 0.7917 | 0.7354 | 0.7313 | 0.752800 | 0.027557 |
| inceptionai/jais-family-30b-8k-chat | 0.8042 | 0.7604 | 0.7215 | 0.762033 | 0.033782 |
Average Standard Deviation for Judge gpt-4o: 0.02870
Judge: claude-3.5-sonnet
| Model Name | Run_1 | Run_2 | Run_3 | Average Rating | Standard Deviation |
|---|---|---|---|---|---|
| CohereForAI/aya-expanse-8b | 0.8333 | 0.8354 | 0.8354 | 0.834700 | 0.000990 |
| FreedomIntelligence/AceGPT-v2-8B-Chat | 0.7879 | 0.7833 | 0.7812 | 0.784133 | 0.002798 |
| inceptionai/jais-family-30b-8k-chat | 0.7750 | 0.7750 | 0.8070 | 0.785667 | 0.015085 |
Average Standard Deviation for Judge claude-3.5-sonnet: 0.00629
Judge: llama3.1-405b
| Model Name | Run_1 | Run_2 | Run_3 | Average Rating | Standard Deviation |
|---|---|---|---|---|---|
| CohereForAI/aya-expanse-8b | 0.9167 | 0.9167 | 0.9188 | 0.917400 | 0.000990 |
| FreedomIntelligence/AceGPT-v2-8B-Chat | 0.6477 | 0.6188 | 0.6021 | 0.622867 | 0.018837 |
| inceptionai/jais-family-30b-8k-chat | 0.7563 | 0.7750 | 0.7654 | 0.765567 | 0.007635 |
Average Standard Deviation for Judge llama3.1-405b: 0.00915
Judge: Jury
| Model Name | Run_1 | Run_2 | Run_3 | Average Rating | Standard Deviation |
|---|---|---|---|---|---|
| CohereForAI/aya-expanse-8b | 0.8819 | 0.8832 | 0.8832 | 0.882767 | 0.000613 |
| FreedomIntelligence/AceGPT-v2-8B-Chat | 0.7953 | 0.7697 | 0.7789 | 0.781300 | 0.010588 |
| inceptionai/jais-family-30b-8k-chat | 0.7907 | 0.7830 | 0.7837 | 0.785800 | 0.003477 |
Average Standard Deviation for Judge Jury: 0.00489
Self Bias Evaluation
As reported in this study and several other other studies as well, often some biases get introduced when a Large Language Model acts as each the evaluator and the evaluatee. To investigate self-bias, we compared how judges scored their very own responses versus others. The table below summarizes the outcomes:
| Model Name | GPT-4o-mini | Claude-3.5-sonnet | Llama 3.1-405b | GPT-4o |
|---|---|---|---|---|
| Claude-3.5-sonnet-20241022 | 0.8532 | 0.8432 | 0.8244 | 0.8442 |
| Meta-Llama 3.1-405B-Instruct-8bit[^1] | 0.7856 | 0.7943 | 0.8100 | 0.7928 |
| GPT-4o | 0.7810 | 0.7995 | 0.7921 | 0.8025 |
| GPT-4o-mini | 0.7093 | 0.6290 | 0.6403 | 0.7222 |
[^1]: Inception’s internal deployment of a bnb 8bit quantization of “meta-llama/Llama-3.1-405B-Instruct”.
Key Observations
The rows correspond to the models which can be being evaluated, and the columns show the scores assigned by the various judges, including the model’s self-assigned rating. For instance:
-
GPT-4o scores itself as 0.8025, its highest rating across all models, suggesting a notable self-bias.
-
This trend of models assigning their very own responses higher scores is consistent across all judges, except Claude-3.5-sonnet, which scores its responses barely lower in comparison with others.
-
GPT-4o-mini and by extension GPT-4o as well, display the best self-bias, as their self-scores exceed those assigned by other judges.
-
Claude-3.5-sonnet appears less self-biased, with its self-score aligning closely with scores given by other judges.
-
Meta-Llama 3.1-405B-Instruct shows moderate alignment between its self-score and external scores, suggesting balanced scoring.
By observing the discrepancies in self-scoring relative to external scoring, we quantify the degree of self-bias, which may influence the reliability of a model as a judge.
Hallucination Evaluation
To evaluate the reliability of the judges in adhering to evaluation guidelines, we conducted hallucination evaluation. This experiment focused on determining whether the judges provided accurate, guideline-compliant comments and avoided generating hallucinated or nonsensical feedback no matter agreement with human annotators. The evaluation was performed on two judges:
- Claude-3.5-sonnet: chosen because the top-performing judge based on the previous 3 experiments;
- GPT-4o-mini: chosen for its strong balance between cost-efficiency and performance.
Quality Validation
We randomly chosen 10% of the responses from each evaluated model in our pool. Human annotators were tasked with reviewing the judges’ comments to find out:
- Whether the comments adhered to the evaluation guidelines.
- Whether the comments were logically consistent with the model’s response and the bottom truth, or in the event that they displayed any signs of hallucination.
Results
| Judge | Percentage of Agreement |
|---|---|
| GPT-4o-mini | 100.0% |
| Claude-3.5-sonnet | 96.3% |
Key Observations
The outcomes indicated a high level of agreement between the judges’ comments and human evaluations, which aligns with expectations given the simplicity of the duty. The duty required judges to evaluate the factual correctness and the alignment with straightforward guidelines, minimizing the likelihood of hallucination.
Nonetheless, an unexpected discrepancy is observed with Claude-3.5-sonnet, which showed a rather lower agreement rate (96.3%) in comparison with GPT-4o-mini (100.0%). Upon further investigation, we identified that the discrepancy was attributable to 529 & 500 errors codes encountered by Claude-3.5-sonnet for a subset of the examples. These errors resulted in empty judge comment fields, which annotators marked as disagreements.
When analyzing only the valid responses from Claude-3.5-sonnet (i.e., excluding those affected by errors), the agreement rate increased to 100.0%, matching that of GPT-4o-mini. This confirms our hypothesis that the duty design was sufficiently constrained to depart (almost) no room for hallucination, ensuring high reliability across each judges.
Jury: Limitations and Insights
The Jury aggregates scores from multiple judges following a “vote then average” strategy, theoretically leveraging the “wisdom of the gang.” Nonetheless, this approach is constrained by
- No Difference in Rankings: Comparing model rankings from the Jury system and single judges revealed no differences in any respect, undermining the aim of using multiple judges.
- Bias Amplification: The general rating could be inflated if a subset of the judges shares biases favoring certain models, particularly when the judges are from the identical family of huge, general-purpose models.
- High Resource Costs: The computational expense of employing multiple judges makes the approach impractical for large-scale benchmarks.
Potential Improvements: The Jury concept may very well be simpler if it included smaller, fine-tuned models trained on diverse datasets reflecting different perspectives and cultures. One other potential approach we intend to explore is the variation of system prompts to explain the identical task but with linguistic, cultural and perspective variations. This could introduce greater variability in judgement and mitigate the uniformity of biases observed in proprietary, English-first, general-purpose models.
Judge Selection
All of the experiments above favor the choice of Claude-3.5-sonnet because the primary judge for the AraGen Leaderboard, attributable to its
- high consistency (lowest standard deviation);
- minimal self-bias, ensuring fairness;
- relatively high agreement with human annotators, comparable to GPT-4o-mini.
Note that the Cohen Kappa coefficient is comparatively low to base a call on. Nonetheless, aligning with a single human judge is inherently difficult attributable to different human biases. Despite this, we consider it a meaningful signal of potential alignment with a bigger and more diverse pool of human judges. We plan to conduct further experiments on this regard within the upcoming releases (March and June).
GPT-4o-mini, despite its relative alignment with human evaluations, we decided to deprioritize it attributable to higher rating variability which contradicts the goal of reproducibility of results we’re aiming for. The Jury system, although a potentially higher method, was excluded on this version due to scalability challenges, inflated scores, and lack of serious difference with single-judge rankings.
In line with the experiments we conducted up to now, Claude-3.5-sonnet represents probably the most reliable selection for this version of AraGen, balancing consistency and fairness.
Conclusion
We consider that the AraGen Leaderboard represents a crucial step in LLM evaluation, combining rigorous factual and alignment-based assessments through the 3C3H evaluation measure. Designed to deal with challenges akin to data leakage, reproducibility, and scalability, AraGen offers a strong framework, which we consider can be useful for a lot of other languages.
Looking ahead, we plan to expand the AraGen leaderboard in the subsequent three months by introducing latest tasks, while semi-automating dataset creation to boost scalability without compromising quality through human verification. Moreover, we’re exploring more complex questions and tasks to repeatedly challenge and refine model performance, ensuring that the leaderboard stays dynamic and adaptive. Finally, we aim to increase this framework to other languages which can be under-resourced or under-represented on this space. We’re committed to the success of those initiatives and invite collaboration from the community.
