Contact: hyen@cs.princeton.edu
Paper: https://arxiv.org/abs/2410.02694
Website: https://princeton-nlp.github.io/HELMET
Code & Data: https://github.com/princeton-nlp/HELMET
Since we first released HELMET last October, there was more development on long-context language models than ever before, and we’re thrilled to see the adoption of HELMET by the community, corresponding to Microsoft’s Phi-4 and AI21’s Jamba 1.6.
After the initial release, we now have added more models to our evaluation suite and conducted additional analyses. We’re excited to share our recent results and present HELMET at ICLR 2025!
On this blog, we are going to describe the development of HELMET, our key findings, and the way practitioners can use HELMET to distinguish between various LCLMs in future research and applications.
Finally, we are going to conclude with a quickstart guide for using HELMET with HuggingFace.
Evaluating long-context language models is difficult but vital
From summarizing quite a few legal documents to learning recent tasks on the fly, long-context language models (LCLMs) have immense potential to vary the best way we use and interact with language models.
Language models have been limited by their context window, which is around 2K to 8K tokens (e.g., ChatGPT, Llama-2/3).
Recently, model developers have been continuously increasing the context window of their models, with recent models like GPT-4o, Claude-3, and Gemini-1.5 supporting context windows of as much as hundreds of thousands of tokens.

Nonetheless, with longer context windows, previous natural language benchmarks (e.g., Scrolls) are not any longer suitable for evaluating LCLMs.
Consequently, perplexity and artificial tasks (e.g., needle-in-a-haystack) emerged as the most well-liked evaluation metrics for recent LCLMs, but they often don’t reflect real-world performance.
Model developers may evaluate on other arbitrary datasets, which complicates model comparisons.
Moreover, existing benchmarks for LCLMs may show confusing and counterintuitive results, making it obscure the strengths and weaknesses of various models (Figure 1).
On this work, we propose HELMET (Find out how to Evaluate Long-Context Models Effectively and Thoroughly), a comprehensive benchmark for evaluating LCLMs that improves upon existing benchmarks in several ways—diversity, controllability, and reliability.
We evaluate 59 recent LCLMs and find that it’s crucial to judge models across diverse applications to know their capabilities and frontier LCLMs are still limited on complex tasks.
Existing evaluations overly depend on synthetic tasks
With the event of LCLMs across each industry and the open-source community, it’s crucial to have a reliable method for evaluating and comparing these models. Nonetheless, current models are often evaluated on different benchmarks (Table 1).

A standard practice for evaluating long-context language models is to make use of perplexity or synthetic tasks, corresponding to needle-in-a-haystack (NIAH).
Nonetheless, recent works have shown that perplexity doesn’t correlate well with downstream performance (Fang et al., 2024).
In Figure 2, we show that synthetic tasks like NIAH don’t correlate with real-world performance, however the more complex synthetic tasks achieve higher correlation with real-world tasks.

Amongst the present benchmarks with realistic applications, corresponding to ZeroScrolls (Shaman et al., 2023), LongBench (Bai et al., 2024), and InfiniteBench (Zhang et al., 2024), there are still crucial limitations:
- Insufficient coverage of downstream tasks: often focused on specific domains
- Inadequate lengths for testing frontier LCLMs: older QA datasets are sometimes limited to <32K tokens (e.g., QASPER, QuALITY)
- Unreliable metrics: N-gram matching metrics like ROUGE are noisy—they don’t correlate with human judgments (Goyal et al., 2023) and don’t distinguish between models
- Incompatibility with base models: require instruction-tuning, which suggests they can not be used for base model development
Thus, we propose HELMET to handle these limitations and supply a comprehensive evaluation of LCLMs.
Crafting diverse, controllable, and reliable evaluation for LCLMs
We design HELMET with the next desiderata:
- Diverse coverage of downstream tasks
- Controllable length and complexity
- Reliable evaluation for base and instruction-tuned models
Table 2 shows an summary of the benchmark.
In our experiments, we evaluate on input length from 8K to 128K tokens, but HELMET might be easily prolonged to even longer context lengths.

Key improvements over existing benchmarks
Diverse coverage: HELMET features a diverse set of tasks, corresponding to retrieval-augmented generation with real retrieval passages, generation with citations, and summarization. We fastidiously select datasets with naturally long contexts that reflect real-world applications. These datasets are complemented with reliable evaluation settings, corresponding to model-based evaluations and human studies.
Controllable length and difficulty: A crucial dimension to think about when evaluating LCLMs is the input length, as longer inputs can provide more information while difficult the model’s ability to process noisy contexts. In our tasks, we are able to control the input length by changing the variety of retrieved passages (RAG, Cite, Re-rank), the variety of demonstrations (ICL), or the length of the input document (LongQA, Summ). Although LongQA and Summ can’t be easily prolonged to longer contexts, we intentionally selected datasets with natural documents of length far greater than 100K tokens, such that they will still be used to judge frontier LCLMs.
Reliable evaluation: Many existing benchmarks still use n-gram-based metrics, corresponding to ROUGE, despite their poor correlation with human judgments (Goyal et al., 2023). We employ model-based evaluations that show higher distinguishability between models and different input lengths (Figure 3). Moreover, our human studies show that our metrics have a high agreement with human judgments.

Robust prompting: Existing long-context benchmarks often require models to follow instructions, but many model developments revolve around base models, which must depend on synthetic tasks or perplexity for evaluation. Thus, we support base models for a subset of our tasks via in-context learning examples. This substantially improves the performance of base models, which is more reflective of real-world applications.
LCLMs still have an extended strategy to go on real-world tasks
Our experiments and analyses include a comprehensive set of 59 LCLMs. To our knowledge, that is probably the most thorough and controlled comparison of long-context models on diverse applications. These models cover each leading proprietary and open-source models, and we also consider models with different architectures (e.g., full-attention transformers, hybrid architectures) and positional extrapolation techniques. On this section, we are going to highlight just a few key findings from our experiments.
Diverse evaluation is required for assessing long-context abilities
Long-context benchmarks are sometimes constructed with specific applications in mind, corresponding to summarization or query answering, which limits the understanding of LCLMs in a broader context. We examine model performance over a wide selection of real tasks and find that different categories don’t all the time correlate with one another (Figure 4).

While some tasks moderately correlate with one another (e.g., RAG and MS-MARCO) on account of their retrieval-based nature, others show little correlation (e.g., Summ and Cite). Notably, ICL has the bottom correlation with other tasks, which suggests that it’s a singular task that requires different capabilities from the model. Due to this fact, model developers should evaluate across these distinct axes to attract a more holistic picture of the model’s capabilities.
Models degrade with increasing lengths and task complexity
We present the outcomes of the frontier proprietary models in addition to just a few open-source models on HELMET.
Additional results might be present in the paper and the web site.

First, we observe that open-source models lag behind closed-source models on complex tasks. Although the gap appears small on simpler tasks, corresponding to Recall, the gap widens on more complex ones, corresponding to Cite.
Moreover, performance degradation with increasing lengths is category-dependent. Even probably the most advanced models, corresponding to GPT-4o and Gemini, experience a big decrease in performance on tasks like re-ranking. This variation in performance can’t be observed from simply taking a look at the synthetic task performance.
Finally, there is no such thing as a clear winner across all categories, thereby calling for evaluation across different axes. Additional evaluation, corresponding to the performance of various positional extrapolation methods and the lost-in-the-middle phenomenon, might be present in the paper.
Using HELMET for future developments
Find out how to run HELMET
Using HELMET is straightforward! Simply clone our GitHub repository, and every part is able to go after establishing the environment!
We offer many various ways for loading models, which might be configured within the config file:
- using HuggingFace’s
transformerslibrary - using HuggingFace’s TGI to launch a model endpoint in your machine
- using HuggingFace’s Inference Endpoints to launch a distant model endpoint
- using vllm to launch a model endpoint in your machine. Note: You may launch vllm endpoint on Intel Gaudi accelerators.
- using model provider’s APIs
Option 1. Using HuggingFace’s transformers library
Just use the config yamls in our repo and run these evaluations with
python eval.py --config configs/rag.yaml --model_name_or_path
Behind the scenes, HuggingFace’s transformers library is used, and each local and distant models are routinely supported.
Option 2. Using HuggingFace’s TGI
First, follow the instructions on TGI github to launch a model endpoint. Then in your config file, specify the endpoint url. For instance, you’ll be able to have a config.yaml like below
input_max_length: 131072
datasets: kilt_nq
generation_max_length: 20
test_files: data/kilt/nq-dev-multikilt_1000_k1000_dep6.jsonl
demo_files: data/kilt/nq-train-multikilt_1000_k3_dep6.jsonl
use_chat_template: true
max_test_samples: 100
shots: 2
stop_new_line: true
model_name_or_path: tgi:meta-llama/Llama-3.1-8B-Instruct # must add "tgi:" prefix
use_tgi_serving: true # add this line in your config
Then use the command below to run the benchmark
export LLM_ENPOINT=
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT
Option 3. Using HuggingFace’s Inference Endpoints
First arrange an endpoint by following the instructions here. Get the endpoint url and your API key. Then use the identical config yaml shown in Option 2 above, and run the command below.
export LLM_ENPOINT=
export API_KEY=
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT --api_key $API_KEY
Option 4. Using VLLM
You may launch a model endpoint with vllm in your system, including Intel Gaudi2 and Gaudi3 accelerators. See the instructions here on the way to run HELMET using vllm on Intel Gaudi accelerators.
You should use the identical example config.yaml as in Option 2, aside from two lines of change as below:
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct # no prefix needed
use_vllm_serving: true # use vllm as an alternative of tgi
Then use the command below to run the benchmark
export LLM_ENPOINT=
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT
Option 5. Using Model Provider’s APIs
We support APIs from OpenAI, Anthropic, Google, and TogetherAI.
Please check with the instructions in our repo.
Faster development
We recommend using the Recall and RAG tasks for fast iterations during model development.
These tasks achieve an excellent balance between fast evaluation and correlation with other realistic tasks.
You may easily run these evaluations with just
python eval.py --config configs/rag.yaml --model_name_or_path
Quick comparison with existing models
It is usually expensive to run all of the baselines for evaluating LCLMs, especially at long contexts given their computational and memory costs.
For instance, running HELMET in any respect lengths on a 70B model requires a node with 8 * 80GB GPUs for a whole bunch of GPU hours, which might be costly.
By evaluating on HELMET, researchers can directly compare their models to existing ones just by referencing our results, which cover 59 models of various sizes and architectures.
You’ll find the leaderboard on our website.
Looking ahead
HELMET is a step towards a more comprehensive evaluation of long-context language models, but there are still many more exciting applications of LCLMs.
For instance, we recently released LongProc, a benchmark for evaluating LCLMs on long-form generation and following procedures, that are critical for developing reasoning models that generate tens of 1000’s of tokens in considering steps.
Although summarization tasks have long outputs (as much as 1K tokens), LongProc focuses on even longer outputs, as much as 8K tokens.
Much like HELMET, LongProc can also be designed with reliable evaluation settings and diverse tasks.
We’re working on integrating LongProc into HELMET’s evaluation suite, and we hope that this can provide a more comprehensive evaluation of LCLMs on long-form tasks.
Acknowledgements
We thank Mengzhou Xia, Howard Chen, Xi Ye, Yinghui He, Lucy He, Alexander Wettig, Sadhika Malladi, Adithya Bhaskar, Joie Zhang, and other members of the Princeton Language and Intelligence (PLI) group for his or her helpful feedback.
This work is gratefully supported by the Microsoft Speed up Foundation Models Research (AFMR) for Azure OpenAI credits and an Intel grant.
Citation
In the event you find HELMET useful, please consider citing our paper:
@inproceedings{yen2025helmet,
title={HELMET: Find out how to Evaluate Long-Context Language Models Effectively and Thoroughly},
creator={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen},
yr={2025},
booktitle={International Conference on Learning Representations (ICLR)},
}
