Leading the Korean LLM Evaluation Ecosystem

-



Within the fast-evolving landscape of Large Language Models (LLMs), constructing an “ecosystem” has never been more vital. This trend is obvious in several major developments like Hugging Face’s democratizing NLP and Upstage constructing a Generative AI ecosystem.

Inspired by these industry milestones, in September of 2023, at Upstage we initiated the Open Ko-LLM Leaderboard. Our goal was to quickly develop and introduce an evaluation ecosystem for Korean LLM data, aligning with the worldwide movement towards open and collaborative AI development.

Our vision for the Open Ko-LLM Leaderboard is to cultivate a vibrant Korean LLM evaluation ecosystem, fostering transparency by enabling researchers to share their results and uncover hidden talents within the LLM field. In essence, we’re striving to expand the playing field for Korean LLMs.
To that end, we have developed an open platform where individuals can register their Korean LLM and interact in competitions with other models.
Moreover, we aimed to create a leaderboard that captures the unique characteristics and culture of the Korean language. To realize this goal, we made sure that our translated benchmark datasets resembling Ko-MMLU reflect the distinctive attributes of Korean.



Leaderboard design decisions: making a latest private test set for fairness

The Open Ko-LLM Leaderboard is characterised by its unique approach to benchmarking, particularly:

  • its adoption of Korean language datasets, versus the prevalent use of English-based benchmarks.
  • the non-disclosure of test sets, contrasting with the open test sets of most leaderboards: we decided to construct entirely latest datasets dedicated to Open Ko-LLM and maintain them as private, to stop test set contamination and ensure a more equitable comparison framework.

While acknowledging the potential for broader impact and utility to the research community through open benchmarks, the choice to keep up a closed test set environment was made with the intention of fostering a more controlled and fair comparative evaluation.



Evaluation Tasks

The Open Ko-LLM Leaderboard adopts the next five kinds of evaluation methods:

  • Ko-ARC (AI2 Reasoning Challenge): Ko-ARC is a multiple-choice test designed to evaluate scientific pondering and understanding. It measures the reasoning ability required to resolve scientific problems, evaluating complex reasoning, problem-solving skills, and the understanding of scientific knowledge. The evaluation metric focuses on accuracy rates, reflecting how often the model selects the right answer from a set of options, thereby gauging its ability to navigate and apply scientific principles effectively.
  • Ko-HellaSwag: Ko-HellaSwag evaluates situational comprehension and prediction ability, either in a generative format or as a multiple-choice setup. It tests the capability to predict the almost certainly next scenario given a situation, serving as an indicator of the model’s understanding and reasoning abilities about situations. Metrics include accuracy assessing the standard of predictions, depending on whether it’s approached as a multiple-choice.
  • Ko-MMLU (Massive Multitask Language Understanding): Ko-MMLU assesses language comprehension across a big selection of topics and fields in a multiple-choice format. This broad test demonstrates how well a model functions across various domains, showcasing its versatility and depth in language understanding. Overall accuracy across tasks and domain-specific performance are key metrics, highlighting strengths and weaknesses in numerous areas of data.
  • Ko-Truthful QA: Ko-Truthful QA is definitely a multiple-choice benchmark designed to judge the model’s truthfulness and factual accuracy. Unlike a generative format where the model freely generates responses, on this multiple-choice setting, the model is tasked with choosing probably the most accurate and truthful answer from a set of options. This approach emphasizes the model’s ability to discern truthfulness and accuracy inside a constrained selection framework. The first metric for Ko-Truthful QA focuses on the accuracy of the model’s selections, assessing its consistency with known facts and its ability to discover probably the most truthful response among the many provided decisions.
  • Ko-CommonGEN V2: A newly made benchmark for the Open Ko-LLM Leaderboard assesses whether LLMs can generate outputs that align with Korean common sense given certain conditions, testing the model’s capability to provide contextually and culturally relevant outputs within the Korean language.



A leaderboard in motion: the barometer of Ko-LLM

The Open Ko-LLM Leaderboard has exceeded expectations, with over 1,000 models submitted. As compared, the Original English Open LLM Leaderboard now hosts over 4,000 models. The Ko-LLM leaderboard has achieved 1 / 4 of that number in only five months after its launch. We’re grateful for this widespread participation, which shows the colourful interest in Korean LLM development.

Of particular note is the various competition, encompassing individual researchers, corporations, and academic institutions resembling KT, Lotte Information & Communication, Yanolja, MegaStudy Maum AI, 42Maru, the Electronics and Telecommunications Research Institute (ETRI), KAIST, and Korea University.
One standout submission is KT’s Mi:dm 7B model, which not only topped the rankings amongst models with 7B parameters or fewer but additionally became accessible for public use, marking a major milestone.

We also observed that, more generally, two kinds of models show strong performance on the leaderboard:

  • models which underwent cross-lingual transfer or fine-tuning in Korean (like Upstage’s SOLAR)
  • models fine-tuned from LLaMa2, Yi, and Mistral, emphasizing the importance of leveraging solid foundational models for finetuning.

Managing such a giant leaderboard didn’t come without its own challenges. The Open Ko-LLM Leaderboard goals to closely align with the Open LLM Leaderboard’s philosophy, especially in integrating with the Hugging Face model ecosystem. This strategy ensures that the leaderboard is accessible, making it easier for participants to participate, an important think about its operation. Nonetheless, there are limitations as a result of the infrastructure, which relies on 16 A100 80GB GPUs. This setup faces challenges, particularly when running models larger than 30 billion parameters as they require an excessive amount of compute. This results in prolonged pending states for a lot of submissions. Addressing these infrastructure challenges is important for future enhancements of the Open Ko-LLM Leaderboard.



Our vision and next steps

We recognize several limitations in current leaderboard models when considered in real-world contexts:

  • Outdated Data: Datasets like SQUAD and KLEU change into outdated over time. Data evolves and transforms constantly, but existing leaderboards remain fixed in a particular timeframe, making them less reflective of the present moment as a whole lot of latest data points are generated every day.
  • Failure to Reflect the Real World: In B2B and B2C services, data is continuously accrued from users or industries, and edge cases or outliers constantly arise. True competitive advantage lies in responding well to those challenges, yet current leaderboard systems lack the means to measure this capability. Real-world data is perpetually generated, changing, and evolving.
  • Questionable Meaningfulness of Competition: Many models are specifically tuned to perform well on the test sets, potentially leading to a different type of overfitting inside the test set. Thus, the present leaderboard system operates in a leaderboard-centric manner reasonably than being real-world-centric.

We subsequently plan to further develop the leaderboard in order that it addresses these issues, and becomes a trusted resource widely known by many. By incorporating quite a lot of benchmarks which have a powerful correlation with real-world use cases, we aim to make the leaderboard not only more relevant but additionally genuinely helpful to businesses. We aspire to bridge the gap between academic research and practical application, and can constantly update and enhance the leaderboard, through feedback from each the research community and industry practitioners to make sure that the benchmarks remain rigorous, comprehensive, and up-to-date. Through these efforts, we hope to contribute to the advancement of the sphere by providing a platform that accurately measures and drives the progress of huge language models in solving practical and impactful problems.

Should you develop datasets and would love to collaborate with us on this, we’ll be delighted to speak with you, and you’ll be able to contact us at chanjun.park@upstage.ai or contact@upstage.ai!

As a side note, we consider that evaluations in an actual online environment, versus benchmark-based evaluations, are highly meaningful. Even inside benchmark-based evaluations, there’s a necessity for benchmarks to be updated monthly or for the benchmarks to more specifically assess domain-specific features – we would like to encourage such initiatives.



Many because of our partners

The journey of Open Ko-LLM Leaderboard began with a collaboration agreement to develop a Korean-style leaderboard, in partnership with Upstage and the National Information Society Agency (NIA), a key national institution in Korea. This partnership marked the starting signal, and inside only a month, we were capable of launch the leaderboard.
To validate common sense reasoning, we collaborated with Professor Heuiseok Lim‘s research team at Korea University to include KoCommonGen V2 as a further task for the leaderboard.
Constructing a strong infrastructure was crucial for fulfillment. To that end, we’re grateful to Korea Telecom (KT) for his or her generous support of GPU resources and to Hugging Face for his or her continued support. It’s encouraging that Open Ko-LLM Leaderboard has established a direct line of communication with Hugging Face, a world leader in natural language processing, and we’re in continuous discussion to push latest initiatives forward.
Furthermore, the Open Ko-LLM Leaderboard boasts a prestigious consortium of credible partners: the National Information Society Agency (NIA), Upstage, KT, and Korea University. The participation of those institutions, especially the inclusion of a national agency, lends significant authority and trustworthiness to the endeavor, underscoring its potential as a cornerstone in the educational and practical exploration of language models.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x