Very Large Language Models and The best way to Evaluate Them

Large language models can now be evaluated on zero-shot classification tasks with Evaluation on the Hub!

Zero-shot evaluation is a preferred way for researchers to measure the performance of enormous language models, as they’ve been shown to learn capabilities during training without explicitly being shown labeled examples. The Inverse Scaling Prize is an example of a recent community effort to conduct large-scale zero-shot evaluation across model sizes and families to find tasks on which larger models may perform worse than their smaller counterparts.

Enabling zero-shot evaluation of language models on the Hub

Evaluation on the Hub helps you evaluate any model on the Hub without writing code, and is powered by AutoTrain. Now, any causal language model on the Hub may be evaluated in a zero-shot fashion. Zero-shot evaluation measures the likelihood of a trained model producing a given set of tokens and doesn’t require any labelled training data, which allows researchers to skip expensive labelling efforts.

We’ve upgraded the AutoTrain infrastructure for this project so that giant models may be evaluated totally free 🤯! It’s expensive and time-consuming for users to determine write custom code to guage big models on GPUs. For instance, a language model with 66 billion parameters may take 35 minutes simply to load and compile, making evaluation of enormous models accessible only to those with expensive infrastructure and extensive technical experience. With these changes, evaluating a model with 66-billion parameters on a zero-shot classification task with 2000 sentence-length examples takes 3.5 hours and may be done by anyone locally. Evaluation on the Hub currently supports evaluating models as much as 66 billion parameters, and support for larger models is to return.

The zero-shot text classification task takes in a dataset containing a set of prompts and possible completions. Under the hood, the completions are concatenated with the prompt and the log-probabilities for every token are summed, then normalized and compared with the right completion to report accuracy of the duty.

On this blog post, we’ll use the zero-shot text classification task to guage various OPT models on WinoBias, a coreference task measuring gender bias related to occupations. WinoBias measures whether a model is more prone to pick a stereotypical pronoun to fill in a sentence mentioning an occupation, and observe that the outcomes suggest an inverse scaling trend with respect to model size.

Case study: Zero-shot evaluation on the WinoBias task

The WinoBias dataset has been formatted as a zero-shot task where classification options are the completions. Each completion differs by the pronoun, and the goal corresponds to the anti-stereotypical completion for the occupation (e.g. “developer” is stereotypically a male-dominated occupation, so “she” can be the anti-stereotypical pronoun). See here for an example:

Next, we will select this newly-uploaded dataset within the Evaluation on the Hub interface using the text_zero_shot_classification task, select the models we’d like to guage, and submit our evaluation jobs! When the job has been accomplished, you’ll be notified by email that the autoevaluator bot has opened a brand new pull request with the outcomes on the model’s Hub repository.

Plotting the outcomes from the WinoBias task, we discover that smaller models usually tend to select the anti-stereotypical pronoun for a sentence, while larger models usually tend to learn stereotypical associations between gender and occupation in text. This corroborates results from other benchmarks (e.g. BIG-Bench) which show that larger, more capable models usually tend to be biased with regard to gender, race, ethnicity, and nationality, and prior work which shows that larger models usually tend to generate toxic text.

Enabling higher research tools for everybody

Open science has made great strides with community-driven development of tools just like the Language Model Evaluation Harness by EleutherAI and the BIG-bench project, which make it straightforward for researchers to know the behaviour of state-of-the-art models.

Evaluation on the Hub is a low-code tool which makes it easy to match the zero-shot performance of a set of models along an axis resembling FLOPS or model size, and to match the performance of a set of models trained on a particular corpora against a unique set of models. The zero-shot text classification task is incredibly flexible—any dataset that may be permuted right into a Winograd schema where examples to be compared only differ by a number of words may be used with this task and evaluated on many models directly. Our goal is to make it easy to upload a brand new dataset for evaluation and enable researchers to simply benchmark many models on it.

An example research query which may be addressed with tools like that is the inverse scaling problem: while larger models are generally more capable at nearly all of language tasks, there are tasks where larger models perform worse. The Inverse Scaling Prize is a contest which challenges researchers to construct tasks where larger models perform worse than their smaller counterparts. We encourage you to try zero-shot evaluation on models of all sizes together with your own tasks! Should you find an interesting trend along model sizes, consider submitting your findings to round 2 of the Inverse Scaling Prize.

Send us feedback!

At Hugging Face, we’re excited to proceed democratizing access to state-of-the-art machine learning models, and that features developing tools to make it easy for everybody to guage and probe their behavior. We’ve previously written about how necessary it’s to standardize model evaluation methods to be consistent and reproducible, and to make tools for evaluation accessible to everyone. Future plans for Evaluation on the Hub include supporting zero-shot evaluation for language tasks which could not lend themselves to the format of concatenating completions to prompts, and adding support for even larger models.

Probably the most useful things you possibly can contribute as a part of the community is to send us feedback! We’d love to listen to from you on top priorities for model evaluation. Tell us your feedback and have requests by posting on the Evaluation on the Hub Community tab, or the forums!

Source link

Very Large Language Models and The best way to Evaluate Them

Enabling zero-shot evaluation of language models on the Hub

Case study: Zero-shot evaluation on the WinoBias task

Enabling higher research tools for everybody

Send us feedback!

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Image Classification with AutoTrain

Claude Opus 4.6 Anthropic

Japanese Stable Diffusion

the Digital Object Identifier to Datasets and Models

Optimization story: Bloom inference

Very Large Language Models and The best way to Evaluate Them

Enabling zero-shot evaluation of language models on the Hub

Case study: Zero-shot evaluation on the WinoBias task

Enabling higher research tools for everybody

Send us feedback!

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.