What is going on on with the Open LLM Leaderboard?

-



Recently an interesting discussion arose on Twitter following the discharge of Falcon 🦅 and its addition to the Open LLM Leaderboard, a public leaderboard comparing open access large language models.

The discussion centered around considered one of the 4 evaluations displayed on the leaderboard: a benchmark for measuring Massive Multitask Language Understanding (shortname: MMLU).

The community was surprised that MMLU evaluation numbers of the present top model on the leaderboard, the LLaMA model 🦙, were significantly lower than the numbers within the published LLaMa paper.

So we decided to dive in a rabbit hole to know what was happening and learn how to fix it 🕳🐇

In our quest, we discussed with each the good @javier-m who collaborated on the evaluations of LLaMA and the amazing @slippylolo from the Falcon team. This being said, all of the errors within the below must be attributed to us reasonably than them in fact!

Along this journey with us you’ll learn loads concerning the ways you’ll be able to evaluate a model on a single evaluation and whether or to not imagine the numbers you see online and in papers.

Ready? Then buckle up, we’re taking off 🚀.



What is the Open LLM Leaderboard?

First, note that the Open LLM Leaderboard is definitely only a wrapper running the open-source benchmarking library Eleuther AI LM Evaluation Harness created by the EleutherAI non-profit AI research lab famous for creating The Pile and training GPT-J, GPT-Neo-X 20B, and Pythia. A team with serious credentials within the AI space!

This wrapper runs evaluations using the Eleuther AI harness on the spare cycles of Hugging Face’s compute cluster, and stores the leads to a dataset on the hub which are then displayed on the leaderboard online space.

For the LLaMA models, the MMLU numbers obtained with the Eleuther AI LM Evaluation Harness significantly differ from the MMLU numbers reported within the LLaMa paper.

Why is that the case?



1001 flavors of MMLU

Well it seems that the LLaMA team adapted one other code implementation available online: the evaluation code proposed by the unique UC Berkeley team which developed the MMLU benchmark available at https://github.com/hendrycks/test and that we’ll call here the “Original implementation”.

When diving further, we found yet one more interesting implementation for evaluating on the exact same MMLU dataset: the evalution code provided in Stanford’s CRFM very comprehensive evaluation benchmark Holistic Evaluation of Language Models that we’ll call here the HELM implementation.

Each the EleutherAI Harness and Stanford HELM benchmarks are interesting because they gather many evaluations in a single codebase (including MMLU), and thus give a large view of a model’s performance. That is the rationale the Open LLM Leaderboard is wrapping such “holistic” benchmarks as an alternative of using individual code bases for every evaluation.

To settle the case, we decided to run these three possible implementations of the identical MMLU evaluation on a set of models to rank them in response to these results:

(Note that the Harness implementation has been recently updated – more on this at the tip of our post)

The outcomes are surprising:

png

You will discover the total evaluation numbers at the tip of the post.

These different implementations of the identical benchmark give widely different numbers and even change the rating order of the models on the leaderboard!

Let’s try to know where this discrepancy comes from 🕵️But first, let’s briefly understand how we are able to robotically evaluate behaviors in modern LLMs.



How we robotically evaluate a model in today’s LLM world

MMLU is a multiple alternative query test, so a reasonably easy benchmark (versus open-ended questions) but as we’ll see, this still leaves plenty of room for implementation details and differences. The benchmark consists of questions with 4 possible answers covering 57 general knowledge domains grouped in coarse grained categories: “Humanities”, “Social Sciences”, “STEM”, etc

For every query, only considered one of the provided answers is the proper one. Here is an example:

Query: Glucose is transported into the muscle cell:


Decisions:
A. via protein transporters called GLUT4.
B. only within the presence of insulin.
C. via hexokinase.
D. via monocarbylic acid transporters.


Correct answer: A

Note: you’ll be able to very easily explore more of this dataset within the dataset viewer on the hub.

Large language models are easy models within the AI model zoo. They take a string of text as input (called a “prompt”), which is cut into tokens (words, sub-words or characters, depending on how the model is built) and fed within the model. From this input, they generate a distribution of probability for the subsequent token, over all of the tokens they know (so called the “vocabulary” of the model): you’ll be able to due to this fact get how `probable’ any token is as a continuation of the input prompt.

We are able to use these probabilities to decide on a token, as an example probably the most probable (or we are able to introduce some slight noise with a sampling to avoid having “too mechanical” answers). Adding our chosen token to the prompt and feeding it back to the model allows to generate one other token and so forth until whole sentences are created as continuations of the input prompt:

png

That is how ChatGPT or Hugging Chat generate answers.

In summary, we’ve two principal ways to get information out of a model to judge it:

  1. get the probabilities that some specific tokens groups are continuations of the prompt – and compare these probabilities together for our predefined choices;
  2. get a text generation from the model (by repeatedly choosing tokens as we’ve seen) – and compare these text generations to the texts of assorted predefined choices.

Armed with this information, let’s dive into our three implementations of MMLU, to search out out what input is distributed to models, what is predicted as outputs, and the way these outputs are compared.



MMLU is available in all sizes and styles: the prompts

Let’s compare an example of prompt each benchmark sends to the models by each implementation for a similar MMLU dataset example:

Original implementation Ollmer PR HELM commit cab5d89 AI Harness commit e47e01b
The next are multiple alternative questions (with answers) about us foreign policy.
How did the 2008 financial crisis affect America’s international repute?
A. It damaged support for the US model of political economy and capitalism
B. It created anger at the US for exaggerating the crisis
C. It increased support for American global leadership under President Obama
D. It reduced global use of the US dollar
Answer:
The next are multiple alternative questions (with answers) about us foreign policy.

Query: How did the 2008 financial crisis affect America’s international repute?
A. It damaged support for the US model of political economy and capitalism
B. It created anger at the US for exaggerating the crisis
C. It increased support for American global leadership under President Obama
D. It reduced global use of the US dollar
Answer:

Query: How did the 2008 financial crisis affect America’s international repute?
Decisions:
A. It damaged support for the US model of political economy and capitalism
B. It created anger at the US for exaggerating the crisis
C. It increased support for American global leadership under President Obama
D. It reduced global use of the US dollar
Answer:

The differences between them can seem small, did you see all of them? Here they’re:

  • First sentence, instruction, and topic: Few differences. HELM adds an additional space, and the Eleuther LM Harness doesn’t include the subject line
  • Query: HELM and the LM Harness add a “Query:” prefix
  • Decisions: Eleuther LM Harness prepends them with the keyword “Decisions”



Now how will we evaluate the model from these prompts?

Let’s start with how the original MMLU implementation extracts the predictions of the model. In the unique implementation we compare the possibilities predicted by the model, on the 4 answers only:

png

This may be useful for the model in some case, as an example, as you’ll be able to see here:

png

On this case, the model got a +1 rating for rating the proper answer highest among the many 4 options. But when we take a take a look at the total vocabulary it might have reasonably generated a word outside of our 4 options: the word “Zygote” (that is more of an example than an actual use case 🙂)

How can we be sure that the model does as few as possible of some of these errors?

We are able to use a “few shots” approach during which we offer the model with one or several examples within the prompt, with their expected answers as well. Here is the way it looks:

png

Here, the model has one example of the expected behavior and is thus less more likely to predict answers outside of the expected range of answers.

Since this improves performance, MMLU is usually evaluated in 5 shots (prepending 5 examples to every prompt) in all our evaluations: the unique implementation, EleutherAI LM Harness and HELM. (Note: Across benchmarks, though the identical 5 examples are used, their order of introduction to the model can vary, which can also be a possible source of difference, that we’ll not investigate here. You furthermore may obviously must listen to avoid leaking some answers within the few shot examples you employ…)

HELM: Let’s now turn to the HELM implementation. While the few-shot prompt is mostly similar, the way in which the model is evaluated is sort of different from the unique implementation we’ve just seen: we use the subsequent token output probabilities from the model to pick out a text generation and we compare it to the text of the expected answer as displayed here:

png

On this case, if our “Zygote” token was as an alternative the best probability one (as we’ve seen above), the model answer (“Zygote”) can be flawed and the model wouldn’t rating any points for this query:

png

Harness: Now we finally turn to the – EleutherAI Harness implementation as of January 2023 which was used to compute the primary numbers for the leaderboard. As we are going to see, we’ve came yet one more technique to compute a rating for the model on the exact same evaluation dataset (note that this implementation has been recently updated – more on this at the tip).

On this case, we’re using the possibilities again but this time the possibilities of the total answer sequence, with the letter followed by the text of the reply, as an example “C. The second pharyngeal arch”. To compute the probability for a full answer we get the probability for every token (like we saw above) and gather them. For numerical stability we gather them by summing the logarithm of the possibilities and we are able to determine (or not) to compute a normalization during which we divide the sum by the variety of tokens to avoid giving an excessive amount of advantage to longer answers (more on this later). Here is the way it looks like:

png

Here’s a table summary of the answers provided and generated by the model to summarize what we’ve seen so far:

Original implementation HELM AI Harness (as of Jan 2023)
We compare the possibilities of the next letter answers: The model is predicted to generate as text the next letter answer: We compare the possibilities of the next full answers:
A
B
C
D
A A. It damaged support for the US model of political economy and capitalism
B. It created anger at the US for exaggerating the crisis
C. It increased support for American global leadership under President Obama
D. It reduced global use of the US dollar

We’ve covered all of them!

Now let’s compare the model scores on these three possible ways to judge the models:

MMLU (HELM) MMLU (Harness) MMLU (Original)
llama-65b 0.637 0.488 0.636
tiiuae/falcon-40b 0.571 0.527 0.558
llama-30b 0.583 0.457 0.584
EleutherAI/gpt-neox-20b 0.256 0.333 0.262
llama-13b 0.471 0.377 0.47
llama-7b 0.339 0.342 0.351
tiiuae/falcon-7b 0.278 0.35 0.254
togethercomputer/RedPajama-INCITE-7B-Base 0.275 0.34 0.269

We are able to see that for a similar dataset, each absolute scores and model rankings (see the primary figure) are very sensitive to the evaluation method we determine to make use of.

For instance you’ve got trained yourself an ideal reproduction of the LLaMA 65B model and evaluated it with the harness (rating 0.488, see above). You are now comparing it to the published number (evaluated on the unique MMLU implementation so with a rating 0.637). With such a 30% difference in rating you are probably considering: “Oh gosh, I actually have completly tousled my training 😱”. But nothing could possibly be farther from the reality, these are only numbers which will not be in any respect comparable even in the event that they’re each labelled as “MMLU rating” (and evaluated on the exact same MMLU dataset).

Now, is there a “best way” to judge a model amongst all of the ones we have seen? It’s a tough query. Different models may fare in a different way when evaluated a method or one other as we see above when the rankings change. To maintain this as fair as possible, one could also be tempted to pick out an implementation where the typical rating for all tested models is the best in order that we “unlock” as many capabilities as possible from the models. In our case, that may mean using the log-likelihood option of the unique implementation. But as we saw above, using the log-likelihood can also be giving some indications to the model not directly by restricting the scope of possible answers, and thus helps the less powerful models perhaps an excessive amount of. Also log-likelihood is straightforward to access for open-source models but is just not at all times exposed for closed source API models.

And also you, reader, what do you think that? This blog post is already long so it is time to open the discussion and invite your comments. Please come discuss this topic in the next discussion thread of the Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/82



Conclusion

A key takeaway lesson from our journey is that evaluations are strongly tied to their implementations–right down to minute details reminiscent of prompts and tokenization. The mere indication of “MMLU results” gives you little to no details about how you’ll be able to compare these numbers to others you evaluated on one other library.

Because of this open, standardized, and reproducible benchmarks reminiscent of the EleutherAI Eval Harness or Stanford HELM are invaluable to the community. Without them, comparing results across models and papers can be not possible, stifling research on improving LLMs.

Post scriptum: Within the case of the Open LLM Leaderboard we’ve decided to follow using community maintained evaluation libraries. Thankfully in the course of the writing of this blog post, the amazing community across the EleutherAI Harness, and particularly ollmer
have done an incredible work updating the evaluation of MMLU within the harness to make it just like the unique implementation and match these numbers.

We’re currently updating the total leaderboard with the updated version of the EleutherAI Eval Harness, so expect to see scores coming from the Eleuther Harness v2 coming up in the subsequent few weeks! (Running all of the models again will take a while, stay tuned :hugs:)



Acknowledgements:

We’re very grateful to Xavier Martinet, Aurélien Rodriguez and Sharan Narang from the LLaMA team for helpful suggestions on this blog post in addition to having answered all our questions.



Reproducibility hashes:

Listed below are the commit hashes of the varied code implementations utilized in this blog post.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x