Open LLM Leaderboard: DROP deep dive

-



Recently, three recent benchmarks were added to the Open LLM Leaderboard: Winogrande, GSM8k and DROP, using the unique implementations reproduced within the EleutherAI Harness. A cursory have a look at the scores for DROP revealed something strange was occurring, with the overwhelming majority of models scoring lower than 10 out of 100 on their f1-score! We did a deep dive to know what was occurring, include us to see what we came upon!



Initial observations

DROP (Discrete Reasoning Over Paragraphs) is an evaluation where models must extract relevant information from English-text paragraphs before executing discrete reasoning steps on them (for instance, sorting or counting items to reach at the proper answer, see the table below for examples). The metrics used are custom f1 and exact match scores.

Examples of reasoning and paragraph from the unique article.

We added it to the Open LLM Leaderboard three weeks ago, and observed that the f1-scores of pretrained models followed an unexpected trend: after we plotted DROP scores against the leaderboard original average (of ARC, HellaSwag, TruthfulQA and MMLU), which is an affordable proxy for overall model performance, we expected DROP scores to be correlated with it (with higher models having higher performance). Nonetheless, this was only the case for a small variety of models, and all of the others had a really low DROP f1-score, below 10.

Two trends could be observed within the DROP scores: some follow the common (in diagonal), others are stuck around 5 (vertical line on the fitting of the graph).



Normalization interrogations

During our first deeper dive in these surprising behavior, we observed that the normalization step was possibly not working as intended: in some cases, this normalization ignored the proper numerical answers once they were directly followed by a whitespace character apart from an area (a line return, for instance).
Let us take a look at an example, with the generation being 10nnPassage: The 2011 census recorded a population of 1,001,360, and the gold answer being 10.

Normalization happens in several steps, each for generation and gold:

  1. Split on separators |, -, or
    The start sequence of the generation 10nnPassage: contain no such separator, and is due to this fact considered a single entity after this step.
  2. Punctuation removal
    The primary token then becomes 10nnPassage (: is removed)
  3. Homogenization of numbers
    Every string that could be solid to drift is taken into account a number and solid to drift, then re-converted to string. 10nnPassage stays the identical, because it can’t be solid to drift, whereas the gold 10 becomes 10.0.
  4. Other steps
    A whole lot of other normalization steps ensue (removing articles, removing other whitespaces, etc.) and our original example becomes 10 passage 2011.0 census recorded population of 1001360.0.

Nonetheless, the general rating shouldn’t be computed on the string, but on the bag of words (BOW) extracted from the string, here {'recorded', 'population', 'passage', 'census', '2011.0', '1001360.0', '10'}, which is compared with the BOW of the gold, also normalized within the above manner, {10.0}. As you’ll be able to see, they don’t intersect, despite the fact that the model predicted the proper output!

In summary, if a number is followed by any form of whitespace apart from a straightforward space, it would not go through the number normalization, hence never match the gold if it is usually a number! This primary issue was prone to mess up the scores quite a bit, but clearly it was not the one factor causing DROP scores to be so low. We decided to analyze a bit more.



Diving into the outcomes

Extending our investigations, our friends at Zeno joined us and undertook a way more thorough exploration of the outcomes, 5 models which were representative of the issues we noticed in DROP scores: falcon-180B and mistral-7B were underperforming in comparison with what we were expecting, Yi-34B and tigerbot-70B had a superb performance on DROP correlated with their average scores, and facebook/xglm-7.5B fell in the center.

You may give analyzing the outcomes a try within the Zeno project here if you wish to!

The Zeno team found two much more concerning features:

  1. Not a single model got an accurate result on floating point answers
  2. Top quality models which generate long answers even have a lower f1-score

At this point, we believed that each failure cases were actually attributable to the identical root factor: using . as a stopword token (to finish the generations):

  1. Floating point answers are systematically interrupted before their generation is complete
  2. Higher quality models, which attempt to match the few-shot prompt format, will generate AnswernnPlausible prompt for the subsequent query., and only stop throughout the plausible prompt continuation after the actual answer on the primary ., due to this fact generating too many words and getting a nasty f1 rating.

We hypothesized that each these problems could possibly be fixed through the use of n as an alternative of . as an end of generation stop word.



Changing the tip of generation token

So we gave it a try! We investigated using n as the tip of generation token on the available results. We split the generated answer on the primary n it contained, if one was present, and recomputed the scores.
Note that this is simply an approximation of the proper result, because it won’t fix answers that were cut too early on . (for instance floating point answers) – nevertheless it also won’t give unfair advantage to any model, as all of them were affected by this problem.
Nonetheless it’s the most effective we could do without rerunning models (as we desired to keep the community posted as soon as possible).

The outcomes we got were the next – splitting on n correlates very well with other scores and due to this fact with overall performance.

We are able to see in orange that the scores computed on the brand new strings correlate a lot better with the common performance.



So what’s next?

A fast calculation shows that re-running the total evaluation of all models can be quite costly (the total update took 8 years of GPU time, and loads of it was taken by DROP), we estimated how much it will cost to only re-run failing examples.

In 10% of the cases, the gold answer is a floating number (for instance 12.25) and model predictions start with the proper starting (for our example, 12) but are cut off on a . – these predictions likely would have actually been correct if the generation was to proceed. We might definitely have to re-run them!
Our estimation doesn’t count generated sentences that finish with a number which was possibly interrupted (40% of the opposite generations), nor any prediction tousled by its normalization.

To get correct results, we might thus have to re-run greater than 50% of the examples, an enormous amount of GPU time! We’d like to make certain that the implementation we’ll run is correct this time.

After discussing it with the unbelievable EleutherAI team (each on GitHub and internally), who guided us through the code and helped our investigations, it became very clear that the LM Eval Harness implementation follows the “official DROP” code very strictly: a new edition of this benchmark’s evaluation thus must be developed!
We’ve due to this fact taken the choice to remove DROP from the Open LLM Leaderboard until a new edition arises.

One take away of this investiguation is the worth in having the numerous eyes of the community collaboratively investiguate a benchmark as a way to detect errors that were previously missed. Here again the facility of open-source, community and developping within the open-shines in that it allows to transparently investigate the basis reason for a difficulty on a benchmark which has been on the market for a few years.

We hope that interested members of the community will join forces with academics working on DROP evaluation to repair each its scoring and its normalization. We would find it irresistible becomes usable again, because the dataset itself is admittedly quite interesting and funky. We encourage you to offer feedback on how we must always evaluate DROP on this issue.

Due to the numerous community members who identified issues on DROP scores, and lots of due to the EleutherAI Harness and Zeno teams for his or her great assistance on this issue.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x