Top Evaluation Metrics for RAG Failures

Artificial Intelligence

Top Evaluation Metrics for RAG Failures

admin

February 2, 2024

If you’ve been experimenting with large language models (LLMs) for search and retrieval tasks, you’ve likely come across retrieval augmented generation (RAG) as a method so as to add relevant contextual information to LLM generated responses. By connecting an LLM to personal data, RAG can enable a greater response by feeding relevant data within the context window.

RAG has been shown to be highly effective for complex query answering, knowledge-intensive tasks, and enhancing the precision and relevance of responses for AI models, especially in situations where standalone training data may fall short.

Nevertheless, these advantages from RAG can only be reaped when you are repeatedly monitoring your LLM system at common failure points — most notably with response and retrieval evaluation metrics. On this piece we’ll undergo the perfect workflows for troubleshooting poor retrieval and response metrics.

It’s value remembering that RAG works best when required information is instantly available. Whether relevant documents can be found focuses RAG system evaluations on two critical elements:

Retrieval Evaluation: To evaluate the accuracy and relevance of the documents that were retrieved
Response Evaluation: Measure the appropriateness of the response generated by the system when the context was provided

Figure 2: Response Evals and Retrieval Evals in an LLM Application (image by creator)

Table 1: Response Evaluation Metrics

Table 2: Retrieval Evaluation Metrics

Let’s review three potential scenarios to troubleshoot poor LLM performance based on the flow diagram.

Scenario 1: Good Response, Good Retrieval

On this scenario all the things within the LLM application is acting as expected and we now have an excellent response with an excellent retrieval. We discover our response evaluation is “correct” and our “Hit = True.” Hit is a binary metric, where “True” means the relevant document was retrieved and “False” would mean the relevant document was not retrieved. Note that the combination statistic for Hit is the Hit rate (percent of queries which have relevant context).

For our response evaluations, correctness is an evaluation metric that could be done simply with a mix of the input (query), output (response), and context as could be seen in Table 1. Several of those evaluation criteria don’t require user labeled ground-truth labels since LLMs will also be used to generate labels, scores, and explanations with tools just like the OpenAI function calling, below is an example prompt template.

These LLM evals could be formatted as numeric, categorical (binary and multi-class) and multi-output (multiple scores or labels) — with categorical-binary being essentially the most commonly used and numeric being the least commonly used.

Scenario 2: Bad Response, Bad Retrieval

On this scenario we discover that the response is inaccurate and the relevant content was not received. Based on the query we see that the content wasn’t received because there isn’t a solution to the query. The LLM cannot predict future purchases regardless of what documents it’s supplied. Nevertheless, the LLM can generate a greater response than to hallucinate a solution. Here it might be to experiment with the prompt that’s generating the response by simply adding a line to the LLM prompt template of “if relevant content shouldn’t be provided and no conclusive solution is found, respond that the reply is unknown.” In some cases the right answer is that the reply doesn’t exist.

Scenario 3: Bad Response, Mixed Retrieval Metrics

On this third scenario, we see an incorrect response with mixed retrieval metrics (the relevant document was retrieved, however the LLM hallucinated a solution resulting from being given an excessive amount of information).

To guage an LLM RAG system, you must each fetch the suitable context after which generate an appropriate answer. Typically, developers will embed a user query and use it to look a vector database for relevant chunks (see Figure 3). Retrieval performance hinges not only on the returned chunks being semantically just like the query, but on whether those chunks provide enough relevant information to generate the right response to the query. Now, you need to configure the parameters around your RAG system (variety of retrieval, chunk size, and K).

Similarly with our last scenario, we are able to try editing the prompt template or change out the LLM getting used to generate responses. For the reason that relevant content is retrieved throughout the document retrieval process but isn’t being surfaced by the LLM, this might be a fast solution. Below is an example of an accurate response generated from running a revised prompt template (after iterating on prompt variables, LLM parameters, and the prompt template itself).

When troubleshooting bad responses with mixed performance metrics, we want to first determine which retrieval metrics are underperforming. The easiest method of doing that is to implement thresholds and monitors. Once you might be alerted to a selected underperforming metric you may resolve with specific workflows. Let’s take nDCG for instance. nDCG is used to measure the effectiveness of your top ranked documents and takes into consideration the position of relevant docs, so when you retrieve your relevant document (Hit = ‘True’), it would be best to consider implementing a reranking technique to get the relevant documents closer to the highest ranked search results.

For our current scenario we retrieved a relevant document (Hit = ‘True’), and that document is in the primary position, so let’s try to improve the precision (percent relevant documents) as much as ‘K’ retrieved documents. Currently our Precision@4 is 25%, but when we used only the primary two relevant documents then Precision@2 = 50% since half of the documents are relevant. This alteration results in the right response from the LLM because it is given less information, but more relevant information proportionally.

Essentially what we were seeing here is a typical problem in RAG generally known as lost in the center, when your LLM is overwhelmed with an excessive amount of information that shouldn’t be all the time relevant after which is unable to offer the perfect answer possible. From our diagram, we see that adjusting your chunk size is certainly one of the primary things many teams do to enhance RAG applications nevertheless it’s not all the time intuitive. With context overflow and lost in the center problems, more documents isn’t all the time higher, and reranking won’t necessarily improve performance. To guage which chunk size works best, you must define an eval benchmark and do a sweep over chunk sizes and top-k values. Along with experimenting with chunking strategies, testing out different text extraction techniques and embedding methods may also improve overall RAG performance.

The response and retrieval evaluation metrics and approaches in this piece offer a comprehensive technique to view an LLM RAG system’s performance, guiding developers and users in understanding its strengths and limitations. By continually evaluating these systems against these metrics, improvements could be made to reinforce RAG’s ability to offer accurate, relevant, and timely information.

Additional advanced methods for improving RAG include re-ranking, metadata attachments, testing out different embedding models, testing out different indexing methods, implementing HyDE, implementing keyword search methods, or implementing Cohere document mode (just like HyDE). Note that while these more advanced methods — like chunking, text extraction, embedding model experimentation — may produce more contextually coherent chunks, these methods are more resource-intensive. Using RAG together with advanced methods could make performance improvements to your LLM system and can proceed to accomplish that so long as your retrieval and response metrics are properly monitored and maintained.

Questions? Please reach out to me here or on LinkedIn, X, or Slack!

Table 1: Response Evaluation Metrics

Table 2: Retrieval Evaluation Metrics

Scenario 1: Good Response, Good Retrieval

Scenario 2: Bad Response, Bad Retrieval

Scenario 3: Bad Response, Mixed Retrieval Metrics

LEAVE A REPLY Cancel reply