The 🇨🇿 BenCzechMark is the primary and most comprehensive evaluation suite for assessing the talents of Large Language Models (LLMs) within the Czech language. It goals to check how well LLMs can:
- Reason and perform complex tasks in Czech.
- Generate and confirm grammatically and semantically correct Czech.
- Extract information and store knowledge by answering questions on Czech culture and Czech-related facts.
- Do what language models were originally trained for—estimate the probability of Czech texts.
To realize this, we have sourced 50 tasks spanning 9 categories, with 90% of tasks having native, non-translated content.
On this blog, we introduce each the evaluation suite itself and the BenCzechMark leaderboard, featuring over 25 open-source models of assorted sizes!
📋 Tasks and Categories
The 🇨🇿 BenCzechMark (in it’s current version) is split into 9 categories to comprehensively assess LLM abilities. For every task,
- We manually design not less than 5 prompts, and record best performance and variance across prompts.
- We distinguish between 4 forms of tasks, and associate them with metrics:
- Accuracy (Acc) measures multi-choice(MC) tasks,
- Exact Match (EM) measures tasks with open short answer generation,
- Area Under the Receiver Operating Characteristic Curve (AUROC, computed as average of one-vs-all in multi-class setting) measures the performance on classification tasks, without need for threshold calibration.
Out-of-the-box language models are sometimes biased by the category distributions of their training data, the best way prompts are structured, and the examples provided during inference. These biases can vary across models, making predictions inconsistent depending on the precise model and its influences. To make sure reliable decision-making on datasets with different class distributions, calibration is needed to regulate the model’s predictions. Nevertheless, through the use of threshold-free metrics like AUROC, which deal with rating somewhat than decision thresholds, calibration could be avoided entirely. This approach enables fairer model comparisons by eliminating the necessity for calibration (see e.g., Zhaeo et al., 2021 for more details on calibration of LLMs). - Word-level Perplexity (Ppl) is related to language modeling tasks. It quantifies the likelihood the model would generate text with, normalized per variety of words in corpus.
The translated portion of the dataset (10% of the whole) was mostly translated via CUBBITT LINDAT Translation, aside from CsFever, where the authors used DeepL for translation.
That is the entire list of categories, alongside the datasets and metrics used:
- Reading Comprehension tests whether the system can extract the reply for a matter based on information provided within the context.
- Belebele – Acc – incorporates questions on manually translated web articles.
- SQAD3.2 – EM – is a well-established reading comprehension task in SQuAD format, sourced from Wikipedia.
- Factual Knowledge incorporates questions testing factual knowledge stored within the model.
- Umimeto (5 tasks focused on Biology/Chemistry/History/Informatics/Physics) – Acc – Elementary and highschool questions from respective topics. Sourced from umimeto.org.
- TriviaQA – EM (Translated using CUBITT) – incorporates Q/A from trivia and quiz-league web sites (U.S. centric dataset).
- NaturalQuestions – EM (Translated using CUBITT) – incorporates Q/A from Google Search (U.S. centric dataset). We include these to make sure the model didn’t forget any EN-centric knowledge when prompted in Czech (i.e., after possible domain transfer).
- Czech Language Understanding targets the peculiar understanding of syntactic structure and nuanced meaning within the Czech Language.
- CERMAT (Open/TF/MC) – EM/AUROC/Acc – focuses on understanding tasks sourced from sixth, Ninth-year primary school tests and state highschool exams in Open/True-False/Multiple-choice formats.
- Grammar Error Detection – AUC (True/False grammar error prediction task) – incorporates sentences from language learner essays.
- Agree – Acc – requires filling in missing grammar suffixes of past tense verbs
- Language Modeling tests how likely the model would sample specific Czech language samples.
- Czech National Corpus – Ppl – includes 7 tasks that span across spoken, dialect, historical, and other versions of Czech language, sourced from ČNK.
- HellaSwag – Acc – (Translated using CUBITT) requires choosing plausible continuation of text from 4 options.
- Math Reasoning in Czech quantifies how well the model can process and solve Czech math assignments.
- Klokan QA – Acc – elementary/highschool problems from Czech math competition.
- CERMAT – EM/Acc – Math subsection of CERMAT Open/MC.
- Umimeto (Math) – Acc – Math subsection of Umimeto.
- Natural Language Inference tests whether the text entails the data required within the associated text pair.
- Czech SNLI – AUROC (Translated SNLI using CUBITT + manual correction) – tests for entailment of hypothesis within the premise text.
- CSFever – AUROC (Czech version of FEVER dataset, using partial translation) – asks whether claim is (not less than partially) supported within the evidence.
- CTKFacts – AUROC- same format as CSFEVER, but manually sourced from Czech News Agency articles.
- Propaganda – AUROC – incorporates 13 tasks predicting various facets of reports articles, comparable to location, genre and emotive theme.
- Named Entity Recognition determines whether the model recognizes different named entity types within the text.
- CNEC2.0 – EM – standard NER dataset in Czech
- Court Decisions – EM – NER derived from decisions of Czech Supreme/Constitutional Courts.
- Sentiment Evaluation quantifies how well the model estimates sentiment information within the text.
- Subjectivity – AUROC – asks whether a passage is subjective or objective.
- CzechSentiment (MALL/CSFD/FB) – AUROC – sentiment evaluation of product reviews, movie reviews, and Facebook comments.
- Document Retrieval focuses on identifying the relevant documents.
- Historical IR – Acc – multiple-choice task for choosing passages relevant/irrelevant to a question.
⚔️ Model Duels and Average Rating
Since we use different metrics for the tasks, simply averaging would not work as a consequence of various scales. As a substitute, we have introduced a novel method to determine a final rating: we let the models fight!
For each task and metric, we compute a test for statistical significance at α=0.05. This implies the probability that the performance of model A equals that of model B is estimated to be lower than 0.05. We use the next tests, each with various statistical power:
- ACC and EM: one-tailed paired t-test,
- AUROC: Bayesian test inspired by Goutte et al., 2005,
- Ppl: bootstrapping.
We then compute a model’s duel win rating (DWS) – the proportion of duels won against all other models on that task. Finally, we calculate aggregate scores as follows:
- Category DWS: average of task scores throughout the category,
- Average DWS: average across category DWSs.
This yields an easy-to-understand model rating: Macro-averaged model win-rate!
👑 BenCzechMark Leaderboard – Llama-405B Takes the Crown
To discover the top-performing open-source model in our suite, we evaluated 26 open-weight models using the next parameters:
- Maximum input length: 2048 tokens
- Few-shot examples: 3
- Truncation: Smart truncation (truncates few-shot samples first then task description)
- Log-probability aggregation: Average-pooling (helps mitigate long-document bias)
- Chat templates: Not used
The outcomes could be explored in our Space. While Llama-450B emerged because the clear overall winner, it didn’t dominate every category. Interestingly, some models have excelled in specific areas — as an example:
- Qwen-72B shone in Math and Information Retrieval but lagged behind similarly-sized models in other categories.
- Aya-23-35B model excels in Sentiment and Language Modeling, but similarly lags behind in numerous categories.
- Gemma-2 9B delivers excellent ends in Czech reading comprehension, outperforming much larger models.
🇨🇿 Think Your Model Can Excel in Czech? Submit It!
Certainly one of our major goals at BenCzechMark is to empower researchers to evaluate their models’ capabilities in Czech and to encourage the community to coach and discover models that excel within the Czech language.
In case you know of a model that stands out, we might love so that you can submit it to our leaderboard, making the competition much more exciting!
To enable you to start, we have prepared a simple 3-step guide, which you’ll find within the BenCzechMark space under the Submission tab.
🌟 Acknowledgements
We might prefer to extend our because of all contributors from BUT FIT, FI MUNI, CIIRC CTU, and Hugging Face for his or her invaluable work in bringing BenCzechMark to life.
We’re also grateful to the organizations that provided source data for a number of the tasks, namely Umímeto, CERMAT, and ČNK.
📚 Citation and references
@article{fajcik2024benczechmark,
title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
creator = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
yr = {2024},
url = {[https://huggingface.co/spaces/CZLC/BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark)}
institution = {Brno University of Technology, Masaryk University, Czech Technical University in Prague, Hugging Face},
}
