Suggestions for Getting the Generation Part Right in Retrieval Augmented Generation

-

Image created by creator using Dall-E 3

Results from experiments to guage and compare GPT-4, Claude 2.1, and Claude 3.0 Opus

My because of Evan Jolley for his contributions to this piece

Recent evaluations of RAG systems are published seemingly every single day, and lots of of them deal with the retrieval stage of the framework. Nevertheless, the generation aspect — how a model synthesizes and articulates this retrieved information — may hold equal if not greater significance in practice. Many use cases in production aren’t simply returning a fact from the context, but in addition require synthesizing the very fact right into a more complicated response.

We ran several experiments to guage and compare GPT-4, Claude 2.1 and Claude 3 Opus’ generation capabilities. This text details our research methodology, results, and model nuances encountered along the way in which in addition to why this matters to people constructing with generative AI.

Every part needed to breed the outcomes may be present in this GitHub repository.

Takeaways

  • Although initial findings indicate that Claude outperforms GPT-4, subsequent tests reveal that with strategic prompt engineering GPT-4 demonstrated superior performance across a broader range of evaluations. Inherent model behaviors and prompt engineering matter A LOT in RAG systems.
  • Simply adding “Please explain yourself then answer the query” to a prompt template significantly improves (greater than 2X) GPT-4’s performance. It’s clear that when an LLM talks answers out, it seems to assist in unfolding ideas. It’s possible that by explaining, a model is re-enforcing the precise answer in embedding/attention space.
Diagram created by creator

While retrieval is answerable for identifying and retrieving essentially the most pertinent information, it’s the generation phase that takes this raw data and transforms it right into a coherent, meaningful, and contextually appropriate response. The generative step is tasked with synthesizing the retrieved information, filling in gaps, and presenting it in a way that is definitely comprehensible and relevant to the user’s query.

In lots of real-world applications, the worth of RAG systems lies not only of their ability to locate a particular fact or piece of knowledge but in addition of their capability to integrate and contextualize that information inside a broader framework. The generation phase is what enables RAG systems to maneuver beyond easy fact retrieval and deliver truly intelligent and adaptive responses.

The initial test we ran involved generating a date string from two randomly retrieved numbers: one representing the month and the opposite the day. The models were tasked with:

  1. Retrieving Random Number #1
  2. Isolating the last digit and incrementing by 1
  3. Generating a month for our date string from the result
  4. Retrieving Random Number #2
  5. Generating the day for our date string from Random Number 2

For instance, random numbers 4827143 and 17 would represent April seventeenth.

These numbers were placed at various depths inside contexts of various length. The models initially had quite a difficult time with this task.

Figure 1: Initial test results (image by creator)

While neither model performed great, Claude 2.1 significantly outperformed GPT-4 in our initial test, almost quadrupling its success rate. It was here that Claude’s verbose nature — providing detailed, explanatory responses — looked as if it would give it a definite advantage, leading to more accurate outcomes in comparison with GPT-4’s initially concise replies.

Prompted by these unexpected results, we introduced a latest variable to the experiment. We instructed GPT-4 to “explain yourself then answer the query,” a prompt that encouraged a more verbose response akin to Claude’s natural output. The impact of this minor adjustment was profound.

Figure 2: Initial test with targeted prompt results (image by creator)

GPT-4’s performance improved dramatically, achieving flawless leads to subsequent tests. Claude’s results also improved to a lesser extent.

This experiment not only highlights the differences in how language models approach generation tasks but in addition showcases the potential impact of prompt engineering on their performance. The verbosity that seemed to be Claude’s advantage turned out to be a replicable strategy for GPT-4, suggesting that the way in which a model processes and presents its reasoning can significantly influence its accuracy in generation tasks. Overall, including the seemingly minute “explain yourself” line to our prompt played a task in improving the models’ performance across all of our experiments.

Figure 3: 4 further tests used to guage generation (image by creator)

We conducted 4 more tests to evaluate prevailing models’ ability to synthesize and transform retrieved information into various formats:

  • String Concatenation: Combining pieces of text to form coherent strings, testing the models’ basic text manipulation skills.
  • Money Formatting: Formatting numbers as currency, rounding them, and calculating percentage changes to guage the models’ precision and talent to handle numerical data.
  • Date Mapping: Converting a numerical representation right into a month name and date, requiring a mix of retrieval and contextual understanding.
  • Modulo Arithmetic: Performing complex number operations to check the models’ mathematical generation capabilities.

Unsurprisingly, each model exhibited strong performance in string concatenation, reaffirming previous understanding that text manipulation is a fundamental strength of language models.

Figure 4: Money formatting test results (image by creator)

As for the cash formatting test, Claude 3 and GPT-4 performed almost flawlessly. Claude 2.1’s performance was generally poorer overall. Accuracy didn’t vary considerably across token length, but was generally lower when the needle was closer to the start of the context window.

Figure 5: Normal haystack test results (image by creator)

Despite stellar leads to the generation tests, Claude 3’s accuracy declined in a retrieval-only experiment. Theoretically, simply retrieving numbers must be a better task than manipulating them as well — making this decrease in performance surprising and an area where we’re planning further testing to look at. If anything, this counterintuitive dip only further confirms the notion that each retrieval and generation must be tested when developing with RAG.

By testing various generation tasks, we observed that while each models excel in menial tasks like string manipulation, their strengths and weaknesses develop into apparent in additional complex scenarios. LLMs are still not great at math! One other key result was that the introduction of the “explain yourself” prompt notably enhanced GPT-4’s performance, underscoring the importance of how models are prompted and the way they articulate their reasoning in achieving accurate results.

These findings have broader implications for the evaluation of LLMs. When comparing models just like the verbose Claude and the initially less verbose GPT-4, it becomes evident that the evaluation criteria must extend beyond mere correctness. The verbosity of a model’s responses introduces a variable that may significantly influence their perceived performance. This nuance may suggest that future model evaluations should consider the typical length of responses as a noted factor, providing a greater understanding of a model’s capabilities and ensuring a fairer comparison.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

0 0 votes
Article Rating
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

1
0
Would love your thoughts, please comment.x
()
x