Study: Platforms that rank the most recent LLMs will be unreliable

A firm that desires to make use of a big language model (LLM) to summarize sales reports or triage customer inquiries can make a choice from a whole lot of unique LLMs with dozens of model variations, each with barely different performance.

To narrow down the alternative, corporations often depend on LLM rating platforms, which gather user feedback on model interactions to rank the most recent LLMs based on how they perform on certain tasks.

But MIT researchers found that a handful of user interactions can skew the outcomes, leading someone to mistakenly consider one LLM is the best alternative for a specific use case. Their study reveals that removing a tiny fraction of crowdsourced data can change which models are top-ranked.

They developed a quick method to check rating platforms and determine whether or not they are prone to this problem. The evaluation technique identifies the person votes most answerable for skewing the outcomes so users can inspect these influential votes.

The researchers say this work underscores the necessity for more rigorous strategies to guage model rankings. While they didn’t give attention to mitigation on this study, they supply suggestions which will improve the robustness of those platforms, comparable to gathering more detailed feedback to create the rankings.

The study also offers a word of warning to users who may depend on rankings when making decisions about LLMs that might have far-reaching and dear impacts on a business or organization.

“We were surprised that these rating platforms were so sensitive to this problem. If it seems the top-ranked LLM depends upon only two or three pieces of user feedback out of tens of 1000’s, then one can’t assume the top-ranked LLM goes to be consistently outperforming all the opposite LLMs when it’s deployed,” says Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS); a member of the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems, and Society; an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior writer of this study.

She is joined on the paper by lead authors and EECS graduate students Jenny Huang and Yunyi Shen in addition to Dennis Wei, a senior research scientist at IBM Research. The study shall be presented on the International Conference on Learning Representations.

Dropping data

While there are lots of varieties of LLM rating platforms, the most well-liked variations ask users to submit a question to 2 models and pick which LLM provides the higher response.

The platforms aggregate the outcomes of those matchups to supply rankings that show which LLM performed best on certain tasks, comparable to coding or visual understanding.

By selecting a top-performing LLM, a user likely expects that model’s top rating to generalize, meaning it should outperform other models on their similar, but not similar, application with a set of recent data.

The MIT researchers previously studied generalization in areas like statistics and economics. That work revealed certain cases where dropping a small percentage of information can change a model’s results, indicating that those studies’ conclusions won’t hold beyond their narrow setting.

The researchers desired to see if the identical evaluation may very well be applied to LLM rating platforms.

“At the top of the day, a user desires to know whether or not they are selecting the very best LLM. If only a number of prompts are driving this rating, that implies the rating won’t be the end-all-be-all,” Broderick says.

But it surely can be unattainable to check the data-dropping phenomenon manually. As an example, one rating they evaluated had greater than 57,000 votes. Testing an information drop of 0.1 percent means removing each subset of 57 votes out of the 57,000, (there are greater than 10¹⁹⁴subsets), after which recalculating the rating.

As an alternative, the researchers developed an efficient approximation method, based on their prior work, and adapted it to suit LLM rating systems.

“While we’ve got theory to prove the approximation works under certain assumptions, the user doesn’t must trust that. Our method tells the user the problematic data points at the top, so that they can just drop those data points, re-run the evaluation, and check to see in the event that they get a change within the rankings,” she says.

Surprisingly sensitive

When the researchers applied their technique to popular rating platforms, they were surprised to see how few data points they needed to drop to cause significant changes in the highest LLMs. In a single instance, removing just two votes out of greater than 57,000, which is 0.0035 percent, modified which model is top-ranked.

A distinct rating platform, which uses expert annotators and better quality prompts, was more robust. Here, removing 83 out of two,575 evaluations (about 3 percent) flipped the highest models.

Their examination revealed that many influential votes could have been a results of user error. In some cases, it appeared there was a transparent answer as to which LLM performed higher, however the user selected the opposite model as a substitute, Broderick says.

“We are able to never know what was within the user’s mind at the moment, but perhaps they mis-clicked or weren’t being attentive, or they truthfully didn’t know which one was higher. The large takeaway here is that you just don’t want noise, user error, or some outlier determining which is the top-ranked LLM,” she adds.

The researchers suggest that gathering additional feedback from users, comparable to confidence levels in each vote, would supply richer information that might help mitigate this problem. Rating platforms could also use human mediators to evaluate crowdsourced responses.

For the researchers’ part, they need to proceed exploring generalization in other contexts while also developing higher approximation methods that may capture more examples of non-robustness.

“Broderick and her students’ work shows how you may get valid estimates of the influence of specific data on downstream processes, despite the intractability of exhaustive calculations given the dimensions of recent machine-learning models and datasets,” says Jessica Hullman, the Ginni Rometty Professor of Computer Science at Northwestern University, who was not involved with this work. “The recent work provides a glimpse into the strong data dependencies in routinely applied — but in addition very fragile — methods for aggregating human preferences and using them to update a model. Seeing how few preferences could really change the behavior of a fine-tuned model could encourage more thoughtful methods for collecting these data.”

This research is funded, partially, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the National Science Foundation, Amazon, and a CSAIL seed award.

Study: Platforms that rank the most recent LLMs will be unreliable

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Ethics and Society Newsletter #1

Efficient Few-Shot Learning Without Prompts

How 🤗 Speed up runs very large models due to PyTorch

Image Classification with AutoTrain

Claude Opus 4.6 Anthropic

Study: Platforms that rank the most recent LLMs will be unreliable

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.