Finding Golden Examples: A Smarter Approach to In-Context Learning

-

Context

using Large Language Models (LLMs), In-Context Learning (ICL), where input and output are provided to LLMs to learn from them before handling the following input, has proven to be very effective in guiding the LLM in the precise direction. Multiple ICL strategies have been developed. Some popular ones are one-shot (provide one example), few-shot (provide multiple examples), chain-of-thought (showing step-by-step reasoning in examples) …etc. Let’s take an easy example to higher understand in-context learning.

While you ask a LLM, “What animal makes the sound ‘moo’ and what’s its type?” you is perhaps expecting a solution as “Cow, mammal.” But LLMs give more information. For instance, when asked the identical query, ChatGPT gave the next answer:

User: What animal makes the sound 'moo' and what's its type?
The animal that makes the sound "moo" is a cow.
Animal: Cow
Type: Mammal (specifically, a domesticated ungulate belonging to the species Bos taurus)
Other examples of non-mammals include birds (like eagles 🦅), reptiles (like snakes 🐍)
, fish (like salmon 🐟), and insects (like butterflies 🦋).

Link: https://chatgpt.com/share/6886636f-7b48-8000-a477-54405edd7e43

Now, to show the LLM to provide results as we expect (animal name, type), we will either fine-tune (train) the LLM to output in our expected format, which is just not a feasible approach since training LLMs requires significant resources. Alternatively, during inference we will provide examples before asking the query to show the LLM our expected format. That’s exactly what in-context learning is. LLMs learn the pattern from the provided examples to perform the duty at hand. Here is similar interaction with ChatGPT with in-context examples: 

User: What animal makes the 'woof' sound and what's its type?
Assistant: Dog, mammal
User: What animal makes the 'meow' sound and what's its type?
Assistant: Cat, mammal
User: What animal makes the sound 'moo' and what's its type?

This time, the LLM gave the right answer: Cow, mammal. 

Link: https://chatgpt.com/share/688664f0-96f0-8000-9125-6a40b24d2773

As we will see, LLMs adapt well to In-Context Learning (ICL) to attain their goals. Research has shown ICL helps boost the performance and accuracy of LLMs. But ICL is fragile. Performance is very sensitive to which examples you select, their order, and even minor formatting changes. ICL works through pattern matching reasonably than true learning, so it relies heavily on superficial cues. Imagine for a posh task like code repair, text-to-sql …etc, one set of examples might work well while one other alternative might drop the accuracy significantly. Hence, ICL’s most important challenge is “The way to select examples that truly help (not only any examples)?”

On this post, we’re going to take a look at the research paper AuPair: Golden Example Pairs for Code Repair published by Google DeepMind to systematically handle these issues. AuPair specifically tackles example selection for code repair tasks (fixing buggy code). This post goals to clarify the core ideas behind their work and construct a foundation for understanding the right way to systematically generate examples for ICL.

Effective Example Selection

Now, we understand the primary challenge of ICL is to seek out the precise set of examples. Before we glance into how AuPair’s approach tackles this problem, let’s take a look at the normal approach of example selection. Normally, for domain-specific problems (like code generation/repair or text-to-sql), we randomly pick a number of examples using our own aptitude or pick problems from the dataset, write examples for those chosen problems, and use them at runtime for ICL. One other extension of that is, we construct a pool of examples and use similarity search to drag the relevant examples at runtime to inject as ICL. 

In the normal example curation process, we don’t have the power to measure which example is simplest in anchoring the LLM in the precise direction. Now, let’s take a look at AuPair’s approach and the way it addresses this problem. As an alternative of picking random examples, AuPair first builds a big dataset of example pairs after which applies a greedy selection algorithm to pick the best-performing pairs. Let’s take a look at each the first step by one.

Phase 1: Example Pair generation

Image by Writer

Step one is to create a big collection of candidate repair pairs. AuPair starts with a dataset of coding problems which have test cases. For every problem, it asks the LLM to generate an initial solution (guess). If this guess is partially correct (rating between 0 and 1), it gets added to the training data set.

The repair process takes this broken code and asks the LLM to repair it using a few-shot prompt with k randomly chosen existing pairs as context (k = 32 was utilized in the experiment). If the generated fix scores higher than the unique guess, this becomes a candidate pair (guess → fix). The clever part is that if the fix remains to be not perfect, it becomes a brand new “broken” code that gets added back to the training dataset for further improvement in the following iteration. This creates chains of incremental improvements. AuPair repeats this process 1000’s of times to construct an enormous pool of candidate pairs covering several types of bugs and their fixes.

Phase 2: Golden(Au)Pair Extraction 

Once we’ve got the candidate pair dataset, we’d like to choose essentially the most effective pairs. This process happens in 2 steps. First, we’d like to measure how much impact each candidate repair pair has, and second, we’d like to pick one of the best ones using a greedy algorithm.

Let’s first look into how the effectiveness of candidate repair pairs is measured.

Image by Writer

To measure the effectiveness, we first create a validation dataset — mainly a set of broken code problems. Then, for every problem within the validation dataset, we take each candidate repair pair and use it as a 1-shot example together with the validation problem to generate a fix. Once the fix is generated, it gets tested against the unit test cases, and a rating is calculated for that validation problem.

We create a high quality matrix M where M[i,j] represents how well candidate pair i helps solve validation problem j, which provides us a comprehensive view of which pairs are most helpful across several types of problems. 

Algorithm from AuPair paper

The following step is to seek out the AuPairs using the calculated effectiveness. The algorithm picks the candidate pair with the best average rating across all validation problems and adds it to the AuPair list. The crucial next step is to subtract this pair’s contribution from all remaining pairs within the matrix. This ensures we don’t pick redundant pairs but keep the pairs complementary, each recent AuPair must solve different problems than the previously chosen ones. This process continues until the development falls below a threshold, leading to an ordered list of golden pairs where each teaches something unique.

Image from AuPair paper

Experiment Results

AuPair was benchmarked across 7 different coding problem datasets using 5 different LLM models. It consistently outperformed self-reflection and best-of-N sampling approaches to unravel problems. The outcomes further show that AuPairs achieve 2–3x higher compute efficiency. It takes only 12 AuPairs to achieve the identical performance that requires 32 random pairs. The outcomes also show that AuPairs generated on the CodeForces dataset worked effectively on completely different datasets like HackerEarth and AtCoder. This proves that after we’ve built a great set of golden pairs, they will perform thoroughly on recent problems in the identical domain.

Limitations

AuPair shows promising results, but it surely has a number of constraints too. First, it requires substantial computational cost for making LLM calls to generate candidate example pairs with iterative repair. Second, it heavily relies on evaluation metrics (like unit tests for code) to measure improvement, which will not be available in all domains, and it assumes that complementary examples will lead to higher performance. While this worked for coding problems, it will not be true for all domains. Finally, AuPair was benchmarked against structured contest problems reasonably than more complex real-world codebases.

Conclusion

AuPair shows us a better strategy to do in-context learning for code repair tasks. As an alternative of randomly picking examples, it uses a scientific approach to seek out essentially the most effective repair patterns that truly help the LLM perform higher. While it needs significant upfront compute cost and works best when you have got good evaluation metrics, the outcomes prove it’s well worth the investment, especially for the reason that golden pairs work well across different datasets. This research opens up possibilities for applying similar example selection techniques to other domains(e.g text-to-sql) where we will systematically generate and measure example effectiveness.

References

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x