To acquire a strong view of performance on the holdout set, I created fifty bootstrapped holdout sets from the unique. Running the models related to each approach across all sets provides a distribution of performance. We are able to then determine whether each approach is statistically significantly different from the baseline using the Kolmogorov-Smirnov test.
: The weighted approach marginally underperformed across recall and AUC relative to the baseline. Along with this, the variance across each performance metric appears quite high relative to the opposite approaches.
: The oversampling approach improves model recall relative to baseline, but ends in a drastic deterioration of the precision.
: The approach performs worse than baseline across all metrics.
: The synthetic method uplifts model recall, albeit at the associated fee of precision. While the impact on precision stays substantial, the synthetic approach provides a more resilient alternative for enhancing model recall with less of a detriment to precision compared to the oversampling approach. The robustness of the synthetic approach is further evidenced by the uplift in AUC-PR.