Easy methods to Measure Real Model Accuracy When Labels Are Noisy

truth isn’t perfect. From scientific measurements to human annotations used to coach deep learning models, ground truth all the time has some amount of errors. ImageNet, arguably essentially the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?

In this text, we explore the best way to account for errors in test data labels and estimate a model’s “true” accuracy.

Example: image classification

Let’s say there are 100 images, each containing either a cat or a dog. The photographs are labeled by human annotators who’re known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train a picture classifier on a few of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what’s the “true” accuracy of the model (Aᵗʳᵘᵉ)? A few observations first:

Inside the 90% of predictions that the model got “right,” some examples can have been incorrectly labeled, meaning each the model and the bottom truth are unsuitable. This artificially inflates the measured accuracy.
Conversely, throughout the 10% of “incorrect” predictions, some may very well be cases where the model is true and the bottom truth label is unsuitable. This artificially deflates the measured accuracy.

Given these complications, how much can the true accuracy vary?

Range of true accuracy

True accuracy of model for perfectly correlated and perfectly uncorrelated errors of model and label. Figure by writer.

The true accuracy of our model is dependent upon how its errors correlate with the errors in the bottom truth labels. If our model’s errors perfectly overlap with the bottom truth errors (i.e., the model is unsuitable in the exact same way as human labelers), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

Alternatively, if our model is unsuitable in precisely the other way as human labelers (perfect negative correlation), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

Or more generally:

Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

It’s vital to notice that the model’s true accuracy could be each lower and better than its reported accuracy, depending on the correlation between model errors and ground truth errors.

Probabilistic estimate of true accuracy

In some cases, inaccuracies amongst labels are randomly spread among the many examples and never systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies within the labels, we will derive a more precise estimate of its true accuracy.

Once we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the bottom truth label. This will occur in two scenarios:

Each model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
Each model and ground truth are unsuitable (in the identical way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

Under independence, we will express this as:

Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

Rearranging the terms, we get:

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is throughout the range of 86% to 94% that we derived above.

The independence paradox

Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

True accuracy as a function of model’s reported accuracy when ground truth accuracy = 96%. Figure by writer.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is all the time higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even when we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Model’s “true” accuracy as a function of its reported accuracy and ground truth accuracy. Figure by writer.

Error correlation: why models often struggle where humans do

The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs appear to be cats, then each the bottom truth and model errors are more likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower certain (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper certain.

More generally, model errors are inclined to be correlated with ground truth errors when:

Each humans and models struggle with the identical “difficult” examples (e.g., ambiguous images, edge cases)
The model has learned the identical biases present within the human labeling process
Certain classes or examples are inherently ambiguous or difficult for any classifier, human or machine
The labels themselves are generated from one other model
There are too many classes (and thus too many alternative ways of being unsuitable)

Best practices

The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for correct model evaluation, especially in domains where obtaining perfect ground truth is unattainable or prohibitively expensive.

When evaluating model performance with imperfect ground truth:

Conduct targeted error evaluation: Examine examples where the model disagrees with ground truth to discover potential ground truth errors.
Consider the correlation between errors: In case you suspect correlation between model and ground truth errors, the true accuracy is probably going closer to the lower certain (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
Obtain multiple independent annotations: Having multiple annotators will help estimate ground truth accuracy more reliably.

Conclusion

In summary, we learned that:

The range of possible true accuracy is dependent upon the error rate in the bottom truth
When errors are independent, the true accuracy is usually higher than measured for models higher than random probability
In real-world scenarios, errors are rarely independent, and the true accuracy is probably going closer to the lower certain

Easy methods to Measure Real Model Accuracy When Labels Are Noisy

Example: image classification

Range of true accuracy

Probabilistic estimate of true accuracy

The independence paradox

Error correlation: why models often struggle where humans do

Best practices

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Share your open ML datasets on Hugging Face Hub!

The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

Judge Arena: Benchmarking LLMs as Evaluators

The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

The First Multilingual LLM Debate Competition

Easy methods to Measure Real Model Accuracy When Labels Are Noisy

Example: image classification

Range of true accuracy

Probabilistic estimate of true accuracy

The independence paradox

Error correlation: why models often struggle where humans do

Best practices

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.