Once you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. That is referred to as the prevalence.
But that estimate is biased, because no classifier is ideal. For instance, in case your classifier tells you that you’ve got 20% of positive cases, but its precision is thought to be only 50%, you’d expect the true prevalence to be 0.2 × 0.5 = 0.1, i.e. 10%. But that’s assuming perfect recall (all true positives are flagged by the classifier). If the recall is lower than 1, then you already know the classifier missed some true positives, so that you also have to normalize the prevalence estimate by the recall.
This results in the common formula for getting the true prevalence Pr(y=1) from the positive prediction rate Pr(Å·=1):
But suppose that you should run the classifier greater than once. For instance, it is advisable to do that at regular intervals to detect trends within the prevalence. You’ll be able to’t use this formula anymore, because precision depends upon the prevalence. To make use of the formula above you would need to re-estimate the precision repeatedly (say, with human eval), but then you could possibly just as well also re-estimate the prevalence itself.
How will we get out of this circular reasoning? It seems that binary classifiers produce other performance metrics (besides precision) that don’t rely on the prevalence. These include not only the recall R but additionally the specificity S, and these metrics might be used to regulate Pr(Å·=1) to get an unbiased estimate of the true prevalence using this formula (sometimes called prevalence adjustment):
where:
- Pr(y=1) is the true prevalence
- S is the specificity
- R is the sensitivity or recall
- Pr(Å·=1) is the proportion of positives
The proof is easy:
Solving for Pr(y = 1) yields the formula above.
Notice that this formula breaks down when the denominator R — (1 — S) becomes 0, or when recall becomes equal to the false positive rate 1-S. But remember what a typical ROC curve looks like:
An ROC curve like this one plots recall R (aka true positive rate) against the false positive rate 1-S, so a classifier for which R = (1-S) is a classifier falling on the diagonal of the ROC diagram. It is a classifier that’s, essentially, guessing randomly. True cases and false cases are equally prone to be classified positively by this classifier, so the classifier is totally non-informative, and you may’t learn anything from it—and definitely not the true prevalence.
Enough theory, let’s see if this works in practice:
# randomly draw some covariate
x <- runif(10000, -1, 1)# take the logit and draw the consequence
logit <- plogis(x)
y <- runif(10000) < logit
# fit a logistic regression model
m <- glm(y ~ x, family = binomial)
# make some predictions, using an absurdly low threshold
y_hat <- predict(m, type = "response") < 0.3
# get the recall (aka sensitivity) and specificity
c <- caret::confusionMatrix(factor(y_hat), factor(y), positive = "TRUE")
recall <- unname(c$byClass['Sensitivity'])
specificity <- unname(c$byClass['Specificity'])
# get the adjusted prevalence
(mean(y_hat) - (1 - specificity)) / (recall - (1 - specificity))
# compare with actual prevalence
mean(y)
On this simulation I get recall = 0.049
and specificity = 0.875
. The expected prevalence is a ridiculously biased 0.087
, however the adjusted prevalence is basically equal to the true prevalence (0.498
).
To sum up: this shows how, using a classifier’s recall and specificity, you may adjusted the anticipated prevalence to trace it over time, assuming that recall and specificity are stable over time. You can not do that using precision and recall because precision depends upon the prevalence, whereas recall and specificity don’t.