Recently, the Leaderboards and Evals research team at Hugging Face did small experiments, which highlighted how fickle evaluation might be. For a given task, results are extremely sensitive to minuscule changes in prompt format! Nevertheless, this isn’t what we wish: a model prompted with the identical amount of knowledge as input should output similar results.
We discussed this with our friends at Dottxt, who had an idea – what if there was a technique to increase consistency across prompt formats?
So, let’s dig in!
Context: Evaluation Sensitivity to Format Changes
It has develop into increasingly clear that LLM benchmark performance is closely, and somewhat surprisingly, depending on the format of the prompt itself, though quite a lot of methods have been introduced through the years to scale back prompt-related variance. For instance, once we evaluate models in few-shot, we offer format examples to the model to force a selected pattern in output; once we compare the log-likelihood of plausible answers as a substitute of allowing free-form generation, we try and constrain the reply space.
The Leaderboards and Evals team provided an illustration of this by 8 different prompt formats for a well-known task, MMLU ( 4 subsets of the duty). These prompt variations were provided to five different models (chosen because they were SOTA on the time for his or her size, and covered a wide range of tokenization and languages). Scores were computed using a log-probability evaluation, where essentially the most probable answer is taken into account the proper one, a classic metric for multi-choice tasks.
Let’s take a look at the several formats in additional detail, by utilizing the primary query of the global_facts subset of MMLU.
Query: “As of 2016, about what percentage of adults aged 18 years or older were obese?”
Selections: [ "10%", "20%", "40%", "80%" ]
Correct alternative: “40%”
| Without decisions within the prompt | ||
| As of 2016, about what percentage of adults aged 18 years or older were obese? | Q: As of 2016, about what percentage of adults aged 18 years or older were obese?
A: |
Query: As of 2016, about what percentage of adults aged 18 years or older were obese?
Answer: |
| With decisions within the prompt | ||
| Query: As of 2016, about what percentage of adults aged 18 years or older were obese?
Selections: 10% Answer: |
Query: As of 2016, about what percentage of adults aged 18 years or older were obese?
Selections: A. 10% Answer: |
Query: As of 2016, about what percentage of adults aged 18 years or older were obese?
Selections: (A) 10% Answer: |
| Log probs of 10%, 20%, 40%, 80% | Log probs of 10%, 20%, 40%, 80% vs A, B, C, D | Log probs of 10%, 20%, 40%, 80% vs (A), (B), (C), (D), |
Prompts either contain just the query, or some tags to point that we’re in a matter/answer format, and possibly the alternatives within the prompt. In all cases, evaluations compare the log-likelihood of the alternatives only. All these formats appear within the evaluation literature, and will contain virtually the identical amount of knowledge in each row. Nevertheless, slightly below, you may see the wide variation in performance across these theoretically superficial changes!
Each model sees its performance vary by around 10 points, apart from essentially the most extreme example, Qwen1.5-7B, dropping all of the technique to an accuracy of twenty-two.9% with the seventh prompt variation (mostly as a consequence of a tokenizer issue), with essentially the identical information it was capable of achieve an accuracy of as much as 51.2% with one other prompt.
In isolation, a change in rating isn’t necessarily a giant deal as long as the rating is consistent. Nevertheless, as we will see in the following plot, rating is impacted by these changes:
No model is consistently ranked across prompts though the one difference is their format, not the data itself. Which means that if the authors of Gemma-7b wanted to indicate that their model was superior to Mistral-7B-v0.1, they may accomplish that just by selecting the proper prompt.
As almost nobody reports their precise evaluation setup, that is what has historically happened in model reports, where authors selected to report the setup most advantageous to their model (which is why you’ll see extremely weird reported numbers of few-shots in some papers).
Nevertheless, this isn’t the one source of variance in model scores.
In prolonged experiments, we compared evaluating the identical models, with the identical prompt formats, using the very same few-shot samples shuffled in a different way before the prompt (A/B/C/D/E Prompt vs C/D/A/B/E Prompt, for instance). The next figure shows the model scores delta between these two few-shot orderings: we observe a difference of as much as 3 points in performance for a similar model/prompt combination!
If we wish to have the ability to properly evaluate and compare different models we’d like a technique to overcome this challenge.
Sclar, et al’s Quantifying Language Model’s Sensitivity to Spurious Features in Prompt Design also gives overview of this issue, and the authors introduce FormatSpread, a software tool that evaluates each model with multiple different variations of formats, then calculate the variance of that model’s performance. Solutions corresponding to this allow us to find out with more confidence which models are higher than others, but they arrive at a high computation cost.
What if we focused on the output, not the input, to make results more consistent across these small changes to format?
While FormatSpread is an important try and make leaderboards more fair and honest, what we really need as practical users of LLMs is prompt consistency. That’s, we would love to seek out some technique to reduce this variance amongst prompts.
At .txt, we give attention to improving and higher understanding structured generation, which is when the output of a model is constrained to follow a selected structure. Our library, Outlines, allows us to structure the output of an LLM by defining a daily expression or a context-free grammar (we give examples below).
Our initial use case for structured generation was to make LLMs easier to interact with programmatically, by ensuring responses in well formatted JSON. Nevertheless, we’ve continually been surprised by other advantages of structured generation we’ve uncovered.
When working on earlier research exploring the advantages of structured generation, we demonstrated that structured generation consistently improves benchmark performance, and got here across an interesting edge case when exploring JSON structured prompts.
Usually, changing the prompt format to JSON, even when using unstructured generation, results in improved benchmark performance for just about all models. Nevertheless, this was not the case for MetaMath-Tulpar-7b-v2-Slerp, where we found a dramatic decrease in accuracy when using prompts formatted in JSON. Much more surprising was that when using structured generation to constrain the output of the model, the dip in performance was negligible!
This led us to query whether or not structured generation could possibly be exploited for prompt consistency.
Note on the experimental setup: Specializing in n-shot and shot order
While within the above experiments, Hugging Face’s Leaderboard and Evals research team explored changes to the format of the prompt itself, for the following experiments we’re going to limit the changes.
To focus our exploration of prompt space, we’re going to take a look at various just two properties of the prompt:
- Various the variety of “shots” or examples utilized in the prompt (n*-shot*)
- Various the order of those shots (shot order, specified by a shot seed)
For point 2, with a given n-shot we’re only shuffling the identical n examples. Which means that all shuffles of a 1-shot prompt are the identical. This is completed to avoid conflating the format of a prompt with the information it comprises. Clearly a 5-shot prompt comprises more information than a 1-shot prompt, but every shuffling of a 5-shot prompt comprises the identical examples, only in a unique order.
Initial Exploration: GSM8K 1-8 shot prompting
With a purpose to test this out further, we desired to explore the behavior of two very similar but strong models within the 7B parameter space: Mistral-7Bv0.1 and Zephyr-7B-beta. The explanation behind this alternative is to not only study variance in individual outcomes, but to take a look at the changes in relative rating. We use the GSM8K task which is a set of grade school math word problems.
Here is the essential format of a GSM8K 1-shot prompt with the implied structure highlighted.
With a purpose to consistently generate appropriately structured answers we create a daily expression that matches the structure we see inherent in the unique prompt format. The next regex is utilized in Outlines to define the structure for generation:
We are able to see within the regex that we allow the model to reason for anywhere from 200 to 700 characters, then it must declare that “The reply is” after which reply with as much as 10 digit number (that can’t start with 0).
It’s value mentioning that the regex controlling the structure is analogous, but not an identical to, the regex used to parse out the reply. We’ve learned there’s an interesting little bit of nuance in defining the structure since, just like the prompt, it may well impact performance. For instance, notice that {200,700} within the regex. Which means that the model has 200 to 700 characters to “reason” before answering. Changing these values can impact performance and results in something we consult with as “thought control”, an area we’re hoping to write down more about soon.
Our first experiment was to proceed exploring the GSM8K dataset and iterated on 1 through 8 shot prompting. The outcomes, shown below, were very compelling.
There are two major features we see on this figure: variance in performance across the n-shot setups was majorly reduced and there have been no instances where the rating swapped (Mistral consistently leads over Zephyr). It’s also value mentioning that 1-shot structured performance is substantially higher than 1-shot unstructured performance, and on par with 5-shot. This leads to a different area of research we’re terming “prompt efficiency”.
Diving Deeper: GPQA n-shot and shot order variations
For the following experiment we wanted to take a look at various each n-shots in addition to the order of the n-shots. Order was controlled by setting the seed used for shuffling the examples. As mentioned previously, only the primary n-shots are shuffled to maintain the data consistent between prompts, because of this all 1-shot prompts are the identical across seeds. Here’s an example of the shot order for 4-shot:
| seed | 4-shot order |
|---|---|
| 42 | 2-1-3-0 |
| 1337 | 1-0-3-2 |
| 1981 | 3-2-0-1 |
| 1992 | 0-3-1-2 |
| 12345 | 1-0-2-3 |
Moreover, to explore how transferable these results were, we modified the duty to Graduate-Level Google-Proof Q&A Benchmark (GPQA). GPQA is a tough knowledge multi-choice evaluation task. Below is the prompt format and highlighted structure.
For this next experiment we’re specifically using the ‘diamond’ subset which represents curated and cleaned up top quality questions. Of the 198 questions on this dataset we reserve 8 for n-shot prompting (though only ever used the primary 5), after which evaluated on the remaining 190 questions.
Visualized below we will see a grid representing the accuracy achieved for all of the possible mixtures for shot seed and n, for the 2 models, each without (left) and with (right) structured generation.
One thing which immediately stands out is that the structured output tends to attain higher than the unstructured output across the board. We see the mean of every grid for structured and unstructured below:
Mean of results across prompt seed and n-shot
| model | unstructured | structured |
|---|---|---|
| Mistral-7B-v0.1 | 0.2360 | 0.2935 |
| Zephyr-7b-beta | 0.2387 | 0.3048 |
Moreover, across all of the values within the grid we also find reduced variance when comparing the structured with unstructured generation.
Standard deviation in results across prompt seed and n-shot
| model | unstructured | structured |
|---|---|---|
| Mistral-7B-v0.1 | 0.0213 | 0.0202 |
| Zephyr-7b-beta | 0.0273 | 0.0180 |
This reduction in variance across the grid is analogous to the reduction in variance we saw when just n-shot changes for GSM8K.
While increased expected performance and decreased variance are great properties to have, what we really need to know is the impact on rating. In the following plot we examine these grids when it comes to which of the 2 models can be declared a winner:
- A: Zephyr-7b-beta
- B: Mistral-7B-v0.1
- “-”: tie
As we will see from these images, there’s a serious improvement within the consistency of calling a winner when structured generation is applied. These results paint a consistent picture with the findings we had using GSM8K across various n-shot.
Conclusion and Future Work
While these results are incredibly promising, we still must explore these results across more models and more tasks. What we’ve seen to this point is that structured generation could prove to be an important a part of evaluation. Concurrently increasing the expected rating and decreasing the variance across prompt changes is a really promising result that deserves further research.










