LLMs are actually increasingly capable in English, nevertheless it’s quite hard to understand how well they perform in other national languages, widely spoken but which present their very own set of linguistic challenges. Today, we’re excited to fill this gap for Japanese!
We would wish to announce the Open Japanese LLM Leaderboard, composed of greater than 20 datasets from classical to modern NLP tasks to know underlying mechanisms of Japanese LLMs. The Open Japanese LLM Leaderboard was built by the LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs) in partnership with Hugging Face.
The Japanese language presents its own specific challenges. Morphologically wealthy and in constant evolution resulting from historical and cultural interactions with the remainder of the world, its writing system relies on a combination of three separate sets of characters: simplified Chinese ideographic symbols kanjis (漢字), a phonetic lettering system, Hiraganas (平仮名 / ひらがな), and Katakanas (片仮名 / カタカナ) often used for foreigners words. Modern Japanese is arguably one among the toughest language to process, because it mixes up a mix of Sino-Japanese, native Japanese, Latin script (romaji /ローマ字), loanwords from the Dutch, Portuguese, French, English, German, plus Arabic and traditional Chinese numerals. As well as, the Japanese digital world brought us an evolution of emoticons written in Unicode : ), Kaomoji using Cyrillic alphabet. (っ °Д °;)っ and Greek alphabets _φ(°-°=). Without forgetting, in fact, the classic emojis that originated from Japan with the rise in popularity of mobile phones within the Nineteen Nineties.
The intricate writing system of Japanese hides an additional layer of complexity, the shortage of space between words. Just like the Chinese or Thai languages, Japanese language doesn’t have white space between linguistic units, making the detection of word boundaries extremely difficult during tokenization. Over time, the colourful Japanese ecosystem (from prestigious university laboratories and AI startups to the R&D centers of industry giants) has incorporated the specificities of Japanese NLP to develop modern robust Japanese LLMs, but the sector has been lacking a centralized and open system to check these models.
We due to this fact introduce the Open Japanese LLM Leaderboard, a collaboration between Hugging Face and LLM-jp, to foster transparency in research, and encourage an open-source model development philosophy. We strongly imagine this initiative will function a platform for Japanese and international researchers to collaborate, evaluate, and enhance Japanese LLMs.
Introduction to the Leaderboard Tasks
The Open Japanese LLM Leaderboard evaluates Japanese LLMs using a specialized evaluation suite, llm-jp-eval, covering a variety of 16 tasks from classical ones (comparable to Natural Language Inference, Machine Translation, Summarization, Query Answering) to more modern ones (comparable to Code Generation, Mathematical Reasoning or Human Examination). Tasks are launched in 4-shot.
Datasets have been compiled by the evaluation team of LLM-jp, either built from scratch with linguists, experts, and human annotators, or translated routinely to Japanese and adjusted to Japanese specificities, and for some requiring long context reasoning. For a greater understanding of the leaderboard, we are going to detail samples from 8 datasets (in Japanese followed by the English translation in light gray). For more details about all of the available tasks, please see to the “About” tab of the leaderboard, and official links on each datasets.
Jamp
Jamp (Controlled Japanese Temporal Inference Dataset for Evaluating Generalization Capability of Language Models) is the Japanese temporal inference benchmark for NLI. The dataset explore English and Japanese sentence pairs of assorted temporal inference patterns annotated with the golden labels comparable to entailment, neutral, or contradiction.
JEMHopQA
JEMHopQA (Japanese Explainable Multi-hop Query Answering) is a Japanese multi-hop QA dataset that may evaluate internal reasoning. It’s a task that takes an issue as input and generates a solution and derivations.
jcommonsenseqa
jcommonsenseqa is a Japanese version of CommonsenseQA, which is a multiple-choice query answering dataset. The aim of this dataset is to guage commonsense reasoning ability.
chABSA
chABSA was developed as an Aspect-Based Sentiment Evaluation dataset. ChABSA relies on financial reports of Japanese listed-companies within the 2016 fiscal yr, annotated on the pair of entity, the attribute, and the sentiment. More specifically, 230 out of two,260 corporations listed in Japan (roughly 10% of all company) were annotated in response to the taxonomy of the Japanese financial regulator, Financial Service Agency (FSA).
mbpp-ja
The mbpp-ja dataset is a programming dataset: it’s a Japanese version of Mostly Basic Python Problems dataset (MBPP) translated from English into Japanese by LLM-jp by leveraging the interpretation tool DeepL.
mawps
Based on the dataset MAWPS (A Math Word Problem Repository), the Japanese mawps dataset is a mathematical evaluation dataset. This version evaluates the skills of solving novel tasks by reasoning step-by-step, procedure otherwise referred to as Chain-of-Thought (CoT) reasoning, and was adjusted to converting names of individuals, units, and places to suit the Japanese context. The extent of mathematical reasoning is slightly easy: addition, subtraction, multistep arithmetic, and single or pairs of equations.
JMMLU
JMMLU is a knowledge dataset using four-choice query answers. It consists in Japanese-translated questions from a portion of MMLU dataset that evaluates knowledge on high-school level tests. Based on 57 subjects comparable to astronomy, chemistry, sociology, international law, etc., questions and answers were translated in Japanese, while being adjusted to unique Japanese cultural context like Japanese civics, Japanese geography, and Japanese idioms.
XL-Sum
XL-Sum is a summarisation dataset based on the research titled “XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages” that leverages the Japanese translation of articles from BBC News. The dataset is separated in three parts; the title, the text (the full-length article), and the summary. Topics include global issues, politics, technology, sports, and culture.
Technical Setup
The leaderboard is inspired by the Open LLM Leaderboard. Models which are submitted are deployed routinely using HuggingFace’s Inference endpoints, evaluated through the llm-jp-eval library on the version 1.14.1, with memory-efficient inference and serving engine, vLLM on the verison v0.6.3, and computed within the backend by the premium computer platform for research in Japan, mdx.
Observations
In line with the Japanese LLMs guide Awesome Japanese LLM (available in Japanese, English, and French), Meta’s LLama open-source architecture appears to be the favourite of many Japanese AI labs. Nonetheless, other architectures have also been successfully leveraged by the Japanese open-source community, comparable to Mistral from French Mistral, and Qwen by Chinese Alibaba. These are the architectures which led to one of the best scores on the Japanese LLM Leaderboard.
On general language processing tasks, we observe that Japanese LLMs based on open-source architectures are closing the gap with closed source LLMs, comparable to the Japanese LLM llm-jp-3-13b-instruct, developed by LLM-jp and funded by university grants, reaching a performance just like closed source models. Domain specific datasets, comparable to chABSA (finance), Wikipedia Annotated Corpus (linguistic annotations), code generation (mbpp-ja) and summarization (XL-Sum) remain a challenge for many LLMs. Interestingly, models originating from Japanese-based corporations or labs have higher scores on the precise JCommonsenseMorality dataset. It evaluates model ability to make selections in response to Japanese values when against ethical dilemmas
Future directions
The Open Japanese LLM Leaderboard will follow the event of the evaluation tool llm-jp-eval to reflect the constant evolution of Japanese LLMs. The next are only examples of future directions in llm-jp-eval that we would really like to support, be at liberty to contact us to provide a hand or suggest directions!
-
Recent datasets: More Japanese evaluations
The evaluation team of llm-jp-eval is working on this section, adding for the time being JHumanEval (Japanese version of HumanEval) and MMLU (Measuring Massive Multitask Language Understanding). -
Recent evaluation system: Chain-of-Thought evaluation
We would like to check the performance of LLMs between when employing Chain-of-Thought prompts against basic prompts to have a finer understanding of model behaviors. -
Recent metric support: Out-of-Alternative rate
For some evaluation tasks that have already got a transparent list of labels utilized in the precise task, comparable to Natural Language Inference, we might wish to add a complementary metric, testing how often the model predicts out-of-choice tokens. As the alternatives are provided within the prompt, this can allow us to guage how well each LLM is capable of follow specific instructions.
Acknowledgements
Built by the research consortium LLM-jp, the Open Japanese LLM Leaderboard is proudly sponsored by the National Institute of Informatics in Tokyo, Japan in collaboration with the high-performance computing platform, mdx program.
We would really like to increase our gratitude to Prof. Yusuke Miyao and Namgi Han from the University of Tokyo for his or her scientific consultation and guidance, in addition to Clémentine Fourrier and Toshihiro Hayashi of Hugging Face that has assisted with the mixing and customization of their latest evaluation framework and leaderboard template.









