Why MAP and MRR Fail for Search Rating (and What to Use As an alternative)

-

often use Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) to evaluate the standard of their rankings. On this post, we are going to discuss why (MAP) and (MRR) poorly aligned with modern user behavior in search rating. We then take a look at two metrics that function higher alternatives to (MRR) and (MAP).

What are MRR and MAP?

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank ((MRR)) is the common rank where the primary relevant item occurs.

$$mathrm{RR} = frac{1}{text{rank of first relevant item}}$$

In e-commerce, the primary relevant rank may be the rank of the primary item clicked in response to a question

An Amazon seek for ‘burr coffee grinder’. Here, we assume the second item is the relevant result.

For the above example, assume the relevant item is the second item. This implies:
$$mathrm{Reciprocal Rank} = frac{1}{2}$$
Reciprocal rank is calculated for all of the queries within the evaluation set. To get a single metric for all of the queries, we take the mean of reciprocal ranks to get the Mean Reciprocal Rank

$$mathrm{Mean Reciprocal Rank} = frac{1}{N}sum_{i=1}^N {frac{1}{text{Rank of First Relevant Item}}}$$

where (N) is the variety of queries. From this definition, we will see that (MRR) focuses on getting one relevant result early. It doesn’t measure what happens after the primary relevant result.

Mean Average Precision (MAP)

Mean Average Precision ((MAP) measures how well the system retrieves relevant items and the way early they’re shown. We start by first calculating Average Precision (AP) for every query. We define AP as
$$mathrm{AP} = frac{1}Rsum_{k=1}^{K}mathrm{Precision@}k cdot mathbf{1}[text{item at } k text{ is relevant}]$$
where (|R|) is the variety of relevant items for the query
(mathrm{MAP}) is the average of (mathrm{AP}) across queries

The above equation looks so much, but it surely is definitely easy. Let’s use an example to interrupt it down. Assume a question has 3 relevant items, and our model predicts the next order:

Rank: 1 2 3 4 5 
Item: R N R N R

(R = relevant, N = not relevant)
To compute the (MAP), we compute the AP at each relevant position:

  • @1: Precision = 1/1 = 1.0
  • @3: Precision = 2/3 ≈ 0.667
  • @5: Precision = 3/5 = 0.6

$$mathrm{AP} = frac{1}{3}(1.0 + 0.667 + 0.6) = 0.756$$
We calculate the above for all of the queries and average them to get the (MAP). The AP formula has two essential components:

  • Precision@k: Since we use Precision, retrieving relevant items earlier yields higher precision values. If the model ranks relevant items later, Precision@k reduces as a result of a bigger k
  • Averaging the Precisions: We average the precisions over the overall variety of relevant items. If the system never retrieves an item or retrieves it beyond the cutoff, the item contributes nothing to the numerator while still counting within the denominator, which reduces (AP) and (MAP).

Why MAP and MRR are Bad for Search Rating

Now that now we have covered the definitions, let’s understand why (MAP) and (MRR) should not used for search results rating.

Relevance is Graded, not Binary

Once we compute (MRR), we take the rank of the primary relevant item. In (MRR), we treat all relevant items the identical. It makes no difference if a distinct relevant item shows up first. In point of fact, different items are inclined to have different relevance.

Similarly, in (MAP), we use binary relevance- we simply search for the subsequent relevant item. Again, (MAP) makes no distinction within the relevance rating of the items. In real cases, relevance is graded, not binary.

Item     : 1 2 3 
Relevance: 3 1 0

(MAP) and (MRR) each ignore how good the relevant item is. They fail to quantify the relevance.

Users Scan Multiple Results

That is more specific to (MRR). In (MRR) computation, we record the rank of the primary relevant item. And ignore all the things after. It might be good for lookups, QA, etc. But that is bad for recommendations, product search, etc.

During search, users don’t stop at the primary relevant result (apart from cases where there is just one correct response). Users scan multiple results that contribute to overall search relevancy.

MAP overemphasizes recall

(MAP) computes
$$mathrm{AP} = frac{1}Rsum_{k=1}^{K}mathrm{Precision@}k cdot mathbf{1}[text{item at } k text{ is relevant}]$$
As a consequence, every relevant item contributes to the scoring. Missing any relevant item hurts the scoring. When users make a search, they should not inquisitive about finding all of the relevant items. They’re inquisitive about finding the very best few options. (MAP) optimization pushes the model to learn the long tail of relevant items, even when the relevance contribution is low, and users never scroll that far. Hence, (MAP) overemphasizes recall.

MAP Decays Linearly

Consider the instance below. We place a relevant item at three different positions and compute the AP

Rank Precision@k AP
1 1/1 = 1.0 1.0
3 1/3 ≈ 0.33 0.33
30 1/30 ≈ 0.033 0.033
Average Precision across different Ranks

AP at Rank 30 looks worse than Rank 3, which looks worse than Rank 1. The AP rating decays linearly with the rank. In point of fact, Rank 3 vs Rank 30 is greater than a 10x difference. It’s more like seen vs not seen.

(MAP) is position-sensitive but only weakly. It doesn’t reflect how user behavior changes with position. (MAP) is position-sensitive through Precision@k, where the decay with rank is linear. This doesn’t reflect how user attention drops in search results.

NDCG and ERR are Higher Decisions

For search results rating, (NDCG) and (ERR) are higher selections. They fix the problems that (MRR) and (MAP) suffer from.

Expected Reciprocal Rank (ERR)

Expected Reciprocal Rank ((ERR)) assumes a cascade user model wherein a user does the next

  • Scans the list from top to bottom
  • At each rank (i),
    • With probability (R_i), the user is satisfied and stops
    • With probability (1- R_i), the user continues looking ahead

(ERR) computes the expected reciprocal rank of this stopping position where the user is satisfied:
$$mathrm{ERR} = sum_{r=1}^n frac{1}{r} cdot {R}_r cdot prod_{i=1}^{r-1}{1-{R}_i}$$
where (R_i) is (R_i = frac{2^{l_i}-1}{2^{l_m}}) and (l_m) is the utmost possible label value.

Let’s understand how (ERR) is different from (MRR).

  • (ERR) uses (R_i = frac{2^{l_i}-1}{2^{l_m}}), which is graded relevance, so a result can partially satisfy a user
  • (ERR) allows for multiple relevant items to contribute. Early high-quality items reduce the contribution of later items

Example 1

We’ll take a toy example to know how (ERR) and (MRR) differ

Rank     : 1 2 3 
Relevance: 2 3 0
  • (MRR) = 1 (relevant item is at first place)
  • (ERR) =
    • (R_1 = {(2^2 – 1)}/{2^3} = {3}/{8})
    • (R_2 ={(2^3 – 1)}/{2^3} = {7}/{8})
    • (R_3 ={(2^0 – 1)}/{2^3} = 0)
    • (ERR = (1/1) cdot R_1 + (1/2) cdot R_2 + (1/3) cdot R_3 = 0.648)
  • (MRR) says perfect rating. (ERR) says not perfect because the next relevance item appears later

Example 2

Let’s take one other example to see how a change in rating impacts the (ERR) contribution of an item. We are going to place a highly relevant item at different positions in an inventory and compute the (ERR) contribution for that item. Consider the cases below

  • Rating 1: ([8, 4, 4, 4, 4])
  • Rating 2: ([4, 4, 4, 4, 8])

Lets compute

Relevance l 2^l − 1 R(l)
4 15 0.0586
8 255 0.9961
Computing R(l) for various relevance labels

Using this to compute the complete (ERR) for each the rankings, we get:

  • Rating 1: (ERR) ≈ 0.99
  • Rating 2: (ERR) ≈ 0.27

If we specifically take a look at the contribution of the item with the relevance rating of 8, we see that the drop in contribution of that term is 6.36x. If the penalty were linear, the drop can be 5x.

Scenario Contribution of relevance-8 item
Rank 1 0.9961
Rank 5 0.1565
Drop factor 6.36x
Difference in contribution with change in rank

Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain ((NDCG)) is one other great selection that’s well-suited for rating search results. (NDCG) is built on two important ideas

  • Gain: Items with higher relevance scores are value rather more
  • Discount: items appearing later are value much less since users pay less attention to later items

NDCG combines the concept of Gain and Discount to create a rating. Moreover, it also normalizes the rating to permit comparison between different queries. Formally, gain and discount are defined as

  • (mathrm{Gain} = 2^{l_r}-1)
  • (mathrm{Discount} = log_2(1+r))

where (l) is the relevance label of an item at position (r) and (r) is the position at which it occurs.

Gain has an exponential form, which rewards higher relevance. This ensures that items with the next relevance rating contribute rather more. The logarithmic discount penalizes the later rating of relevant items. Combined and applied to your complete ranked list, we get the Discounted Cumulative Gain:

$$mathrm{DCG@K} = sum_{r=1}^{K} frac{2^{l_r}-1}{mathrm{log_2(1+r)}}$$

for a given ranked list (l_1, l_2, l_3, …l_k). (DCG@K) computation is useful, however the relevance labels can vary in scale across queries, which makes comparing (DCG@K) an unfair comparison. So we’d like a method to normalize the (DCG@K) values.

We try this by computing the (IDCG@K), which is the perfect discounted cumulative gain. (IDCG) is the utmost possible (DCG) for a similar items, obtained by sorting them by relevance in descending order.

$$mathrm{DCG@K} = sum_{r=1}^{K} frac{2^{l^*_r}-1}{mathrm{log_2(1+r)}}$$

(IDCG) represents an ideal rating. To normalize the (DCG@K), we compute

$$mathrm{NDCG@K} = frac{mathrm{DCG@K}}{mathrm{IDCG@K}}$$

(NDCG@K) has the next properties

  • Bounded between 0 and 1
  • Comparable across queries
  • 1 is an ideal ordering

Example: Good vs Barely Worse Ordering

In this instance, we are going to take two different rankings of the identical list and compare their (NDCG) values. Assume now we have items with relevance labels on a 0-3 scale.
Rating A

Rank     : 1 2 3 
Relevance: 3 2 1

Rating B

Rank     : 1 2 3 
Relevance: 2 1 3

Computing the (NDCG) components, we get:

Rank Gain (2^l − 1) Discount log₂(1 + r) A contrib B contrib
1 7 1.00 7.00 3.00
2 3 1.58 1.89 4.42
3 1 2.00 0.50 0.50
DCG calculations for every term
  • DCG(A) = 9.39
  • DCG(B) = 7.92
  • IDCG = 9.39
  • NDCG(A) = 9.39 / 9.39 = 1.0
  • NDCG(B) = 7.92 / 9.39 = 0.84

Thus, swapping a relevant item away from rank 1 causes a big drop.

NDCG: Additional Discussion

The discount that forms the denominator of the (DCG) computation is logarithmic in nature. It increases much slower than linearly.

$$mathrm{discount(r)}=frac{1}{mathrm{log2​(1+r)}​}$$

Let’s see how this compares with the linear decay:

Rank
(r)
Linear
(1/r)
Logarithmic
(1 / logâ‚‚(1 + r))
1 1.00 1.00
2 0.50 0.63
5 0.20 0.39
10 0.10 0.29
50 0.02 0.18
Linear Decay vs Logarthmic Decay
  • (1/r) decays
  • (1/log(1+r)) decays

Logarithmic discount penalizes later ranks less aggressively than a linear discount. The difference between rank 1 → 2 is larger, however the difference between rank 10 → 50 is small.

The log discount has a diminishing marginal reduction in penalizing later ranks as a result of its concave shape. This prevents (NDCG) from becoming a top-heavy metric where ranks 1-3 dominate the rating. A linear penalty would ignore reasonable selections lower down.

A logarithmic discount also reflects the proven fact that user attention drops sharply at the highest of the list after which flattens out as an alternative of decreasing linearly with rank.

Conclusion

(MAP) and (MRR) are useful information retrieval metrics, but are poorly fitted to modern search rating systems. While (MAP) focuses on finding all of the relevant documents, (MRR) treats a rating problem as a single-position metric. (MAP) and (MRR) each ignore the graded relevance of things in a search and treat them as binary: relevant and never relevant.

(NDCG) and (ERR) higher reflect real user behavior by accounting for multiple positions, allowing items to have non-binary scores, while giving higher importance to top positions. For search rating systems, these position-sensitive metrics should not just a greater choice- they’re obligatory.

Further Reading

  • LambdaMART (good explanation)
  • Learning To Rank (highly recommend reading this. It’s long and thorough, and likewise the inspiration for this text!)
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x