Holistically Evaluating Long-context Language Models

Contact: hyen@cs.princeton.edu
Paper: https://arxiv.org/abs/2410.02694
Website: https://princeton-nlp.github.io/HELMET
Code & Data: https://github.com/princeton-nlp/HELMET

Since we first released HELMET last October, there was more development on long-context language models than ever before, and we’re thrilled to see the adoption of HELMET by the community, corresponding to Microsoft’s Phi-4 and AI21’s Jamba 1.6.
After the initial release, we now have added more models to our evaluation suite and conducted additional analyses. We’re excited to share our recent results and present HELMET at ICLR 2025!

On this blog, we are going to describe the development of HELMET, our key findings, and the way practitioners can use HELMET to distinguish between various LCLMs in future research and applications.
Finally, we are going to conclude with a quickstart guide for using HELMET with HuggingFace.

Evaluating long-context language models is difficult but vital

From summarizing quite a few legal documents to learning recent tasks on the fly, long-context language models (LCLMs) have immense potential to vary the best way we use and interact with language models.
Language models have been limited by their context window, which is around 2K to 8K tokens (e.g., ChatGPT, Llama-2/3).
Recently, model developers have been continuously increasing the context window of their models, with recent models like GPT-4o, Claude-3, and Gemini-1.5 supporting context windows of as much as hundreds of thousands of tokens.

Figure 1: Existing benchmarks show counterintuitive trends, corresponding to smaller models outperforming larger ones (e.g., Llama-3.1 8B > 70B).

Nonetheless, with longer context windows, previous natural language benchmarks (e.g., Scrolls) are not any longer suitable for evaluating LCLMs.
Consequently, perplexity and artificial tasks (e.g., needle-in-a-haystack) emerged as the most well-liked evaluation metrics for recent LCLMs, but they often don’t reflect real-world performance.
Model developers may evaluate on other arbitrary datasets, which complicates model comparisons.
Moreover, existing benchmarks for LCLMs may show confusing and counterintuitive results, making it obscure the strengths and weaknesses of various models (Figure 1).

On this work, we propose HELMET (Find out how to Evaluate Long-Context Models Effectively and Thoroughly), a comprehensive benchmark for evaluating LCLMs that improves upon existing benchmarks in several ways—diversity, controllability, and reliability.
We evaluate 59 recent LCLMs and find that it’s crucial to judge models across diverse applications to know their capabilities and frontier LCLMs are still limited on complex tasks.

Existing evaluations overly depend on synthetic tasks

With the event of LCLMs across each industry and the open-source community, it’s crucial to have a reliable method for evaluating and comparing these models. Nonetheless, current models are often evaluated on different benchmarks (Table 1).

Table 1: Model developers often evaluate on different sets of datasets. ^♭: Base models. NQA: NarrativeQA, Qspr: Qasper, QALT: QuALITY, SQALT: SQuALTY.

A standard practice for evaluating long-context language models is to make use of perplexity or synthetic tasks, corresponding to needle-in-a-haystack (NIAH).
Nonetheless, recent works have shown that perplexity doesn’t correlate well with downstream performance (Fang et al., 2024).
In Figure 2, we show that synthetic tasks like NIAH don’t correlate with real-world performance, however the more complex synthetic tasks achieve higher correlation with real-world tasks.

Figure 2: Easy synthetic tasks, corresponding to NIAH, don’t correlate well with downstream tasks, corresponding to summarization or generation with citations. More complex variants (e.g., RULER MV) achieve higher correlation.

Amongst the present benchmarks with realistic applications, corresponding to ZeroScrolls (Shaman et al., 2023), LongBench (Bai et al., 2024), and InfiniteBench (Zhang et al., 2024), there are still crucial limitations:

Insufficient coverage of downstream tasks: often focused on specific domains
Inadequate lengths for testing frontier LCLMs: older QA datasets are sometimes limited to <32K tokens (e.g., QASPER, QuALITY)
Unreliable metrics: N-gram matching metrics like ROUGE are noisy—they don’t correlate with human judgments (Goyal et al., 2023) and don’t distinguish between models
Incompatibility with base models: require instruction-tuning, which suggests they can not be used for base model development

Thus, we propose HELMET to handle these limitations and supply a comprehensive evaluation of LCLMs.

Crafting diverse, controllable, and reliable evaluation for LCLMs

We design HELMET with the next desiderata:

Diverse coverage of downstream tasks
Controllable length and complexity
Reliable evaluation for base and instruction-tuned models

Table 2 shows an summary of the benchmark.
In our experiments, we evaluate on input length from 8K to 128K tokens, but HELMET might be easily prolonged to even longer context lengths.

Table 2: Overview of HELMET datasets. SubEM: Substring Exact Match.

Key improvements over existing benchmarks

Diverse coverage: HELMET features a diverse set of tasks, corresponding to retrieval-augmented generation with real retrieval passages, generation with citations, and summarization. We fastidiously select datasets with naturally long contexts that reflect real-world applications. These datasets are complemented with reliable evaluation settings, corresponding to model-based evaluations and human studies.

Controllable length and difficulty: A crucial dimension to think about when evaluating LCLMs is the input length, as longer inputs can provide more information while difficult the model’s ability to process noisy contexts. In our tasks, we are able to control the input length by changing the variety of retrieved passages (RAG, Cite, Re-rank), the variety of demonstrations (ICL), or the length of the input document (LongQA, Summ). Although LongQA and Summ can’t be easily prolonged to longer contexts, we intentionally selected datasets with natural documents of length far greater than 100K tokens, such that they will still be used to judge frontier LCLMs.

Reliable evaluation: Many existing benchmarks still use n-gram-based metrics, corresponding to ROUGE, despite their poor correlation with human judgments (Goyal et al., 2023). We employ model-based evaluations that show higher distinguishability between models and different input lengths (Figure 3). Moreover, our human studies show that our metrics have a high agreement with human judgments.

Figure 3: ROUGE cannot differentiate between models and lengths, while model-based evaluations are higher at separating models of various capacities.

Robust prompting: Existing long-context benchmarks often require models to follow instructions, but many model developments revolve around base models, which must depend on synthetic tasks or perplexity for evaluation. Thus, we support base models for a subset of our tasks via in-context learning examples. This substantially improves the performance of base models, which is more reflective of real-world applications.

LCLMs still have an extended strategy to go on real-world tasks

Our experiments and analyses include a comprehensive set of 59 LCLMs. To our knowledge, that is probably the most thorough and controlled comparison of long-context models on diverse applications. These models cover each leading proprietary and open-source models, and we also consider models with different architectures (e.g., full-attention transformers, hybrid architectures) and positional extrapolation techniques. On this section, we are going to highlight just a few key findings from our experiments.

Diverse evaluation is required for assessing long-context abilities

Long-context benchmarks are sometimes constructed with specific applications in mind, corresponding to summarization or query answering, which limits the understanding of LCLMs in a broader context. We examine model performance over a wide selection of real tasks and find that different categories don’t all the time correlate with one another (Figure 4).

syn — Figure 4: Different categories don’t correlate well with one another.

While some tasks moderately correlate with one another (e.g., RAG and MS-MARCO) on account of their retrieval-based nature, others show little correlation (e.g., Summ and Cite). Notably, ICL has the bottom correlation with other tasks, which suggests that it’s a singular task that requires different capabilities from the model. Due to this fact, model developers should evaluate across these distinct axes to attract a more holistic picture of the model’s capabilities.

Models degrade with increasing lengths and task complexity

We present the outcomes of the frontier proprietary models in addition to just a few open-source models on HELMET.
Additional results might be present in the paper and the web site.

syn — Figure 5: HELMET results on chosen instruction-tuned models across tasks and input lengths.

First, we observe that open-source models lag behind closed-source models on complex tasks. Although the gap appears small on simpler tasks, corresponding to Recall, the gap widens on more complex ones, corresponding to Cite.

Moreover, performance degradation with increasing lengths is category-dependent. Even probably the most advanced models, corresponding to GPT-4o and Gemini, experience a big decrease in performance on tasks like re-ranking. This variation in performance can’t be observed from simply taking a look at the synthetic task performance.

Finally, there is no such thing as a clear winner across all categories, thereby calling for evaluation across different axes. Additional evaluation, corresponding to the performance of various positional extrapolation methods and the lost-in-the-middle phenomenon, might be present in the paper.

Using HELMET for future developments

Find out how to run HELMET

Using HELMET is straightforward! Simply clone our GitHub repository, and every part is able to go after establishing the environment!

We offer many various ways for loading models, which might be configured within the config file:

using HuggingFace’s transformers library
using HuggingFace’s TGI to launch a model endpoint in your machine
using HuggingFace’s Inference Endpoints to launch a distant model endpoint
using vllm to launch a model endpoint in your machine. Note: You may launch vllm endpoint on Intel Gaudi accelerators.
using model provider’s APIs

Option 1. Using HuggingFace’s `transformers` library

Just use the config yamls in our repo and run these evaluations with

python eval.py --config configs/rag.yaml --model_name_or_path

Behind the scenes, HuggingFace’s transformers library is used, and each local and distant models are routinely supported.

Option 2. Using HuggingFace’s TGI

First, follow the instructions on TGI github to launch a model endpoint. Then in your config file, specify the endpoint url. For instance, you’ll be able to have a config.yaml like below

input_max_length: 131072
datasets: kilt_nq
generation_max_length: 20
test_files: data/kilt/nq-dev-multikilt_1000_k1000_dep6.jsonl
demo_files: data/kilt/nq-train-multikilt_1000_k3_dep6.jsonl
use_chat_template: true
max_test_samples: 100
shots: 2
stop_new_line: true
model_name_or_path: tgi:meta-llama/Llama-3.1-8B-Instruct # must add "tgi:" prefix
use_tgi_serving: true # add this line in your config

Then use the command below to run the benchmark

export LLM_ENPOINT= 
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT

Option 3. Using HuggingFace’s Inference Endpoints

First arrange an endpoint by following the instructions here. Get the endpoint url and your API key. Then use the identical config yaml shown in Option 2 above, and run the command below.

export LLM_ENPOINT= 
export API_KEY=
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT --api_key $API_KEY

Option 4. Using VLLM

You may launch a model endpoint with vllm in your system, including Intel Gaudi2 and Gaudi3 accelerators. See the instructions here on the way to run HELMET using vllm on Intel Gaudi accelerators.

You should use the identical example config.yaml as in Option 2, aside from two lines of change as below:

model_name_or_path: meta-llama/Llama-3.1-8B-Instruct # no prefix needed
use_vllm_serving: true # use vllm as an alternative of tgi

Then use the command below to run the benchmark

export LLM_ENPOINT=
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT

Option 5. Using Model Provider’s APIs

We support APIs from OpenAI, Anthropic, Google, and TogetherAI.
Please check with the instructions in our repo.

Faster development

We recommend using the Recall and RAG tasks for fast iterations during model development.
These tasks achieve an excellent balance between fast evaluation and correlation with other realistic tasks.
You may easily run these evaluations with just

python eval.py --config configs/rag.yaml --model_name_or_path

Quick comparison with existing models

It is usually expensive to run all of the baselines for evaluating LCLMs, especially at long contexts given their computational and memory costs.
For instance, running HELMET in any respect lengths on a 70B model requires a node with 8 * 80GB GPUs for a whole bunch of GPU hours, which might be costly.
By evaluating on HELMET, researchers can directly compare their models to existing ones just by referencing our results, which cover 59 models of various sizes and architectures.
You’ll find the leaderboard on our website.

Looking ahead

HELMET is a step towards a more comprehensive evaluation of long-context language models, but there are still many more exciting applications of LCLMs.
For instance, we recently released LongProc, a benchmark for evaluating LCLMs on long-form generation and following procedures, that are critical for developing reasoning models that generate tens of 1000’s of tokens in considering steps.
Although summarization tasks have long outputs (as much as 1K tokens), LongProc focuses on even longer outputs, as much as 8K tokens.
Much like HELMET, LongProc can also be designed with reliable evaluation settings and diverse tasks.
We’re working on integrating LongProc into HELMET’s evaluation suite, and we hope that this can provide a more comprehensive evaluation of LCLMs on long-form tasks.

Acknowledgements

We thank Mengzhou Xia, Howard Chen, Xi Ye, Yinghui He, Lucy He, Alexander Wettig, Sadhika Malladi, Adithya Bhaskar, Joie Zhang, and other members of the Princeton Language and Intelligence (PLI) group for his or her helpful feedback.
This work is gratefully supported by the Microsoft Speed up Foundation Models Research (AFMR) for Azure OpenAI credits and an Intel grant.

Citation

In the event you find HELMET useful, please consider citing our paper:

@inproceedings{yen2025helmet,
      title={HELMET: Find out how to Evaluate Long-Context Language Models Effectively and Thoroughly}, 
      creator={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen},
      yr={2025},
      booktitle={International Conference on Learning Representations (ICLR)},
}

Source link

Holistically Evaluating Long-context Language Models

Evaluating long-context language models is difficult but vital

Existing evaluations overly depend on synthetic tasks

Crafting diverse, controllable, and reliable evaluation for LCLMs

Key improvements over existing benchmarks

LCLMs still have an extended strategy to go on real-world tasks

Diverse evaluation is required for assessing long-context abilities

Models degrade with increasing lengths and task complexity

Using HELMET for future developments

Find out how to run HELMET

Option 1. Using HuggingFace’s `transformers` library

Option 2. Using HuggingFace’s TGI

Option 3. Using HuggingFace’s Inference Endpoints

Option 4. Using VLLM

Option 5. Using Model Provider’s APIs

Faster development

Quick comparison with existing models

Looking ahead

Acknowledgements

Citation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

a Powerful Embedding Model Tailored for Patents and IP with Expert Support from Hugging Face

Chunk Size as an Experimental Variable in RAG Systems

Welcome Gemma 2 – Google’s latest open LLM

Production-Ready LLMs Made Easy with the NeMo Agent Toolkit

Holistically Evaluating Long-context Language Models

Evaluating long-context language models is difficult but vital

Existing evaluations overly depend on synthetic tasks

Crafting diverse, controllable, and reliable evaluation for LCLMs

Key improvements over existing benchmarks

LCLMs still have an extended strategy to go on real-world tasks

Diverse evaluation is required for assessing long-context abilities

Models degrade with increasing lengths and task complexity

Using HELMET for future developments

Find out how to run HELMET

Option 1. Using HuggingFace’s transformers library

Option 2. Using HuggingFace’s TGI

Option 3. Using HuggingFace’s Inference Endpoints

Option 4. Using VLLM

Option 5. Using Model Provider’s APIs

Faster development

Quick comparison with existing models

Looking ahead

Acknowledgements

Citation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Option 1. Using HuggingFace’s `transformers` library

What are your thoughts on this topic?
Let us know in the comments below.