truth isn’t perfect. From scientific measurements to human annotations used to coach deep learning models, ground truth all the time has some amount of errors. ImageNet, arguably essentially the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?
In this text, we explore the best way to account for errors in test data labels and estimate a model’s “true” accuracy.
Example: image classification
Let’s say there are 100 images, each containing either a cat or a dog. The photographs are labeled by human annotators who’re known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train a picture classifier on a few of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what’s the “true” accuracy of the model (Aᵗʳᵘᵉ)? A few observations first:
- Inside the 90% of predictions that the model got “right,” some examples can have been incorrectly labeled, meaning each the model and the bottom truth are unsuitable. This artificially inflates the measured accuracy.
- Conversely, throughout the 10% of “incorrect” predictions, some may very well be cases where the model is true and the bottom truth label is unsuitable. This artificially deflates the measured accuracy.
Given these complications, how much can the true accuracy vary?
Range of true accuracy
The true accuracy of our model is dependent upon how its errors correlate with the errors in the bottom truth labels. If our model’s errors perfectly overlap with the bottom truth errors (i.e., the model is unsuitable in the exact same way as human labelers), its true accuracy is:
Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%
Alternatively, if our model is unsuitable in precisely the other way as human labelers (perfect negative correlation), its true accuracy is:
Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%
Or more generally:
Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)
It’s vital to notice that the model’s true accuracy could be each lower and better than its reported accuracy, depending on the correlation between model errors and ground truth errors.
Probabilistic estimate of true accuracy
In some cases, inaccuracies amongst labels are randomly spread among the many examples and never systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies within the labels, we will derive a more precise estimate of its true accuracy.
Once we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the bottom truth label. This will occur in two scenarios:
- Each model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
- Each model and ground truth are unsuitable (in the identical way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).
Under independence, we will express this as:
Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)
Rearranging the terms, we get:
Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)
In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is throughout the range of 86% to 94% that we derived above.
The independence paradox
Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get
Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is all the time higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even when we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Error correlation: why models often struggle where humans do
The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs appear to be cats, then each the bottom truth and model errors are more likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower certain (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper certain.
More generally, model errors are inclined to be correlated with ground truth errors when:
- Each humans and models struggle with the identical “difficult” examples (e.g., ambiguous images, edge cases)
- The model has learned the identical biases present within the human labeling process
- Certain classes or examples are inherently ambiguous or difficult for any classifier, human or machine
- The labels themselves are generated from one other model
- There are too many classes (and thus too many alternative ways of being unsuitable)
Best practices
The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for correct model evaluation, especially in domains where obtaining perfect ground truth is unattainable or prohibitively expensive.
When evaluating model performance with imperfect ground truth:
- Conduct targeted error evaluation: Examine examples where the model disagrees with ground truth to discover potential ground truth errors.
- Consider the correlation between errors: In case you suspect correlation between model and ground truth errors, the true accuracy is probably going closer to the lower certain (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
- Obtain multiple independent annotations: Having multiple annotators will help estimate ground truth accuracy more reliably.
Conclusion
In summary, we learned that:
- The range of possible true accuracy is dependent upon the error rate in the bottom truth
- When errors are independent, the true accuracy is usually higher than measured for models higher than random probability
- In real-world scenarios, errors are rarely independent, and the true accuracy is probably going closer to the lower certain