Tips on how to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

-

:

👉

👉

of my post series on retrieval evaluation measures for RAG pipelines, we took an in depth have a look at the binary retrieval evaluation metrics. More specifically, in Part 1, we went over binary, order-unaware retrieval evaluation metrics, like HitRate@K, Recall@K, Precision@K, and F1@K. Binary, order-unaware retrieval evaluation metrics are essentially probably the most basic style of measures we are able to use for scoring the performance of our retrieval mechanism; they only classify a result either as relevant or irrelevant, and evaluate if relevant results make it to the retrieved set.

Then, partly 2, we reviewed binary, order-aware evaluation metrics like Mean Reciprocal Rank (MRR) and Average Precision (AP). Binary, order-aware measures categorise results either as relevant or irrelevant and check in the event that they appear within the retrieval set, but on top of this, additionally they quantify how well the outcomes are ranked. In other words, additionally they bear in mind the rating with which each result’s retrieved, aside from whether it’s retrieved or not in the primary place.

On this final a part of the retrieval evaluation metrics post series, I’m going to further elaborate on the opposite large category of metrics, beyond binary metrics. That’s, . Unlike binary metrics, where results are either relevant or irrelevant, for graded metrics, relevance is moderately a spectrum. In this fashion, the retrieved chunk might be to the user’s query.

Two commonly used graded relevance metrics that we’re going to be taking a have a look at in today’s post are Discounted Cumulative Gain (DCG@K) and Normalized Discounted Cumulative Gain (NDCG@k).



Some graded measures

For graded retrieval measures, it’s to begin with vital to grasp the concept of graded relevance. That’s, for graded measures, a retrieved item might be roughly relevant, as quantified by rel_i.

Image by writer

🎯 Discounted Cumulative Gain (DCG@k)

Discounted Cumulative Gain (DCG@k) is a graded, order-aware retrieval evaluation metric, allowing us to quantify how useful a retrieved result’s, bearing in mind the rank with which it’s retrieved. We will calculate it as follows:

Image by writer

Here, the numerator rel_i is the graded relevance of the retrieved result i, essentially, is a quantification of how relevant the retrieved text chunk is. Furthermore, the denominator of this formula is the log of the rating of the result i. Essentially, this permits us to penalize items that appear within the retrieved set with lower ranks, emphasizing the concept that results appearing at the highest are more vital. Thus, the more relevant a result’s, the upper the rating, however the lower the rating it appears at, the lower the rating.

Let’s further explore this with an easy example:

Image by writer

In any case, a significant issue of DCG@k is that, as you’ll be able to see, is actually a sum function of all of the relevant items. Thus, a retrieved set with more items (a bigger k) and/or more relevant items goes to inevitably lead to a bigger DCG@k. As an illustration, if in for instance, just consider k = 4, we might find yourself with a DCG@4 = 28.19. Similarly, DCG@6 can be higher and so forth. As k increases, DCG@k typically increases, since we include more results, unless additional items have zero relevance. Nonetheless, this doesn’t necessarily mean that its retrieval performance is superior. Quite the opposite, this moderately causes an issue since it doesn’t allow us to match retrieved sets with different k values based on DCG@k.

This issue is effectively solved by the subsequent graded measure we’re going to be discussing in a while today – NDCG@k. But before that, we’d like to introduce IDCG@K, required for calculating NDCG@K.

🎯 Ideal Discounted Cumulative Gain (IDCG@k)

Ideal Discounted Cumulative Gain (IDCG@k), as its name suggests, is the DCG we might get in the perfect situation where our retrieved set is perfectly ranked based on the retrieved results’ relevance. Let’s see what the IDCG for our example can be:

Image by writer

Apparently, for a set k, IDCG@k goes to all the time be equal to or larger than any DCG@k, because it represents the rating for an ideal retrieval and rating of results for a certain k.

Finally, we are able to now calculate Normalized Discounted Cumulative Gain (NDCG@k), using DCG@k and IDCG@k.

🎯 Normalized Discounted Cumulative Gain (NDCG@k)

Normalized Discounted Cumulative Gain (NDCG@k) is actually a normalised expression of DCG@k, solving our initial problem and rendering it comparable for various retrieved set sizes k. We will calculate NDCG@k with this straightforward formula:

Image by writer

Principally, NDCG@k allows us to quantify how close our current retrieval and rating is to the perfect one, for a given k. This conveniently provides us with a number that comparable for various values of k. In our example, NDCG@k=5 can be:

Image by writer

Generally, NDCG@k can range from 0 to 1, with 1 representing an ideal retrieval and rating of the result, and 0 indicating a whole mess.

So, how will we actually calculate DCG and NDCG in Python?

When you’ve read my other RAG tutorials, you realize that is where the example would often are available in. Nonetheless, this code example is getting too massive to incorporate in every post, so as a substitute I’m going to point out you find out how to calculate DCG and NDCG in Python, doing my best to maintain this post at an affordable length.

To calculate these retrieval metrics, we first have to define a ground truth set, exactly as we did in Part 1 when calculating Precision@K and Recall@K. The difference here is that, as a substitute of characterising each retrieved chunk as relevant or not, using binary relevances (0 or 1), we now assign to it a graded relevance rating; for instance, from completely irrelevant (0), to super relevant (5). Thus, our ground truth set would come with the text chunks which have the best graded relevance scores for every query.

As an illustration, for a question like , a retrieved chunk that completely matches the reply might receive a rating of three, one which partially mentions the needed information could get a 2, and a totally unrelated chunk would get a relevance rating equal to 0.

Using these graded relevance lists for a retrieved result set, we are able to then calculate DCG@k, IDCG@k, and NDCG@k. We’ll use Python’s math library to handle the logarithmic terms:

import math

Initially, we are able to define a function for calculating DCG@k as follows:

# DCG@k
def dcg_at_k(relevance, k):
    k = min(k, len(relevance))
    return sum(rel / math.log2(i + 1) for i, rel in enumerate(relevance[:k], start=1))

We may also calculate IDCG@k applying the same logic. Essentially, IDCG@k is DCG@k for an ideal retrieval and rating; thus, we are able to easily calculate it by calculating DCG@k after sorting the outcomes by descending relevance.

# IDCG@k
def idcg_at_k(relevance, k):
    ideal_relevance = sorted(relevance, reverse=True)
    return dcg_at_k(ideal_relevance, k)

Finally, after we have now calculated DCG@k and IDCG@k, we may also easily calculate NDCG@k as their function. More specifically:

# NDCG@k
def ndcg_at_k(relevance, k):
    dcg = dcg_at_k(relevance, k)
    idcg = idcg_at_k(relevance, k)
    return dcg / idcg if idcg > 0 else 0.0

As explained, each of those functions takes as input a listing of graded relevance scores for retrieved chunks. As an illustration, let’s suppose that for a particular query, ground truth set, and retrieved results test, we find yourself with the next list:

relevance = [3, 2, 3, 0, 1]

Then, we are able to calculate the graded retrieval metrics using our functions :

print(f"DCG@5: {dcg_at_k(relevance, 5):.4f}")
print(f"IDCG@5: {idcg_at_k(relevance, 5):.4f}")
print(f"NDCG@5: {ndcg_at_k(relevance, 5):.4f}")

And that was that! That is how we get our graded retrieval performance measures for our RAG pipeline in Python.

Finally, similarly to all other retrieval performance metrics, we may also average the scores of a metric across different queries to get a more representative overall rating.

On my mind

Today’s post concerning the graded relevance measures concludes my post series about probably the most commonly used metrics for evaluating the retrieval performance of RAG pipelines. Particularly, throughout this post series, we explored binary measures, order-unaware and order-aware, in addition to graded measures, gaining a holistic view of how we approach this. Apparently, there are numerous other things that we are able to have a look at as a way to evaluate a retrieval mechanism of a RAG pipeline, as as an example, latency per query or context tokens sent. Nonetheless, the measures I went over in these posts cover the basics for evaluating retrieval performance.

This permits us to quantify, evaluate, and ultimately improve the performance of the retrieval mechanism, ultimately paving the way in which for constructing an efficient RAG pipeline that produces meaningful answers, grounded within the documents of our alternative.


📰 💌 💼☕


ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x