A Benchmark for Arabic LLMs in STEM and Code

Why 3LM?

Arabic Large Language Models (LLMs) have seen notable progress in recent times, yet existing benchmarks fall short on the subject of evaluating performance in high-value technical domains. Most evaluations thus far have focused on general-purpose tasks like summarization, sentiment evaluation, or generic query answering. Nevertheless, scientific reasoning and programming are essential for a broad range of real-world applications, from education to technical problem-solving.

To deal with this gap, we introduce 3LM (علم), a multi-component benchmark tailored to guage Arabic LLMs on STEM (Science, Technology, Engineering, and Mathematics) subjects and code generation. 3LM is the primary benchmark of its kind, designed specifically to check Arabic models in structured reasoning and formal logic that are domains traditionally underrepresented in Arabic NLP.

What’s within the Benchmark?

3LM is made up of three datasets, each targeting a selected evaluation axis: real-world multiple-choice STEM questions (MCQs), synthetic high-difficulty STEM questions, and translated code generation tasks.

1. Native STEM

The Native STEM benchmark consists of 865 MCQs extracted from authentic Arabic educational content, including textbooks, worksheets, and exam banks for grades 8 through 12. Questions span five core subjects: Physics, Chemistry, Biology, Mathematics, and Geography.

Each query is annotated with metadata including domain and difficulty (on a 1–10 scale). The information was sourced using a pipeline that combined OCR (including LaTeX math parsing via Pix2Tex), LLM-assisted question-answer extraction, and manual review. This dataset provides a practical testbed for evaluating factual and conceptual understanding in Arabic models using real educational materials.

2. Synthetic STEM

To introduce greater challenge and variety, we created an artificial subset of 1,744 MCQs using the YourBench pipeline. This component draws from Arabic textbook text, which is chunked, summarized, and used as input to an LLM-driven query generation system. The result’s a curated set of questions focused on mid-to-high difficulty reasoning, including conceptual, analytical, and application-based problems.

Synthetic STEM provides a crucial counterbalance to native MCQs by probing deeper reasoning skills and minimizing answer bias. All generated questions underwent filtering based on clarity, structure, and content validity, followed by quality assurance via manual review.

3. Arabic Code Benchmarks

The third component of 3LM targets code generation which is a growing area of LLM evaluation. We translated and adapted the widely-used HumanEval+ and MBPP+ benchmarks into Arabic, creating the primary code datasets that test Arabic LLMs on natural language prompts for programming.

We used GPT-4o for prompt translation and validated the outcomes with a backtranslation pipeline, rejecting low-quality samples based on ROUGE-L F1 thresholds (< 0.8). Additional human filtering ensured prompt clarity and correctness. The code and test suites remain unchanged to preserve scoring fidelity. Evaluations use the EvalPlus framework for pass@1 and pass@1+ metrics.

Constructing the Benchmark

Each dataset in 3LM went through a multi-stage development process to make sure data quality, fairness, and representativeness.

For Native STEM, we collected Arabic PDF sources and applied a dual OCR approach to get well each plain text and math formulas. Questions were extracted using LLM-based chunking and pattern recognition, followed by classification into MCQ format with randomized answer order. Final samples were reviewed by native Arabic speakers with STEM expertise to verify answer validity and readability.

For Synthetic STEM, the YourBench pipeline was adapted for Arabic input. Source documents after ingestion were first summarized, chunked after which fed to a code-controlled generator for MCQ creation. We filtered out image-dependent or ambiguous content, and only retained questions inside goal difficulty ranges. The result’s a set of unpolluted, high-quality synthetic Arabic MCQs for STEM.

For the Code Benchmarks, our goal was to isolate language understanding while preserving code logic. Prompt translation was handled by GPT-4o with verification via reverse translation. Code and tests were untouched to permit evaluation parity with English versions. The result’s a benchmark where Arabic prompts will be evaluated directly with the EvalPlus toolchain.

Key Results

We evaluated over 40 LLMs, including Arabic-first, multilingual, and general-purpose base and instruction-tuned models. Evaluation was performed using each multiple-choice accuracy and generative completion metrics.

Within the MCQ setting, Qwen2.5-72B-Instruct achieved top performance across each native (71.8%) and artificial (67.0%) STEM subsets. For completion tasks, Gemma-3-27B showed the strongest results with 43.2% accuracy on STEM answers.

In code generation, GPT-4o demonstrated best-in-class performance on each HumanEval-ar (83.5% pass@1+) and MBPP-ar (63.6% pass@1+). These results highlight strong correlation (~0.97) between Arabic and English pass@1 scores, suggesting language-specific prompt quality has a significant influence on model outcomes.

We also examined robustness under distractor perturbation, revealing that instruction-tuned models are significantly more stable than their base counterparts. Prompt engineering and zero-shot design were also shown to meaningfully affect Arabic STEM performance.

Evaluation Tooling

We built the benchmark to be easily reproducible using standard tools:

lighteval handles each multiple-choice and open-ended query evaluation for STEM datasets.
evalplus powers robust pass@1 and pass@1+ code scoring using function-level testing.

All scripts, configs, and evaluation pipelines can be found in our GitHub repository, and will be adapted to guage any model compatible with HuggingFace Transformers or OpenAI APIs.

Access the Datasets

All three datasets are open-source and hosted on HuggingFace Datasets:

Citation

For those who use 3LM in your research, please cite us:

@article{boussaha2025threeLM,
  title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
  writer={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
  journal={arXiv preprint arXiv:2507.15850},
  12 months={2025}
}

Source link

A Benchmark for Arabic LLMs in STEM and Code

Why 3LM?

What’s within the Benchmark?

1. Native STEM

2. Synthetic STEM

3. Arabic Code Benchmarks

Constructing the Benchmark

Key Results

Evaluation Tooling

Access the Datasets

Citation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI in Multiple GPUs: ZeRO & FSDP

Trump gets data center corporations to pledge to pay for power generation

NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

Introducing Modular Diffusers – Composable Constructing Blocks for Diffusion Pipelines

Dataset Recording, VLA High quality‑Tuning, and On‑Device Optimizations

A Benchmark for Arabic LLMs in STEM and Code

Why 3LM?

What’s within the Benchmark?

1. Native STEM

2. Synthetic STEM

3. Arabic Code Benchmarks

Constructing the Benchmark

Key Results

Evaluation Tooling

Access the Datasets

Citation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.