Can Synthetic Data Boost Machine Learning Performance? Background — Imbalanced Datasets The Dataset The Model Generating Synthetic Data Assessing Performance with Precision Recall Charts Bootstrapping Holdout Dataset Conclusion

To acquire a strong view of performance on the holdout set, I created fifty bootstrapped holdout sets from the unique. Running the models related to each approach across all sets provides a distribution of performance. We are able to then determine whether each approach is statistically significantly different from the baseline using the Kolmogorov-Smirnov test.

: The weighted approach marginally underperformed across recall and AUC relative to the baseline. Along with this, the variance across each performance metric appears quite high relative to the opposite approaches.

Image by Creator: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Weighted Loss, KS stats — AUC 0.420 p-value < 0.000, precision 0.260 p-value 0.068, Recall 0.520 p-value < 0.000

: The oversampling approach improves model recall relative to baseline, but ends in a drastic deterioration of the precision.

Image by Creator: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Oversampling, KS stats — AUC 0.160 p-value 0.549, precision 1.0 p-value < 0.000, Recall 0.9 p-value < 0.000

: The approach performs worse than baseline across all metrics.

Image by Creator: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Oversampling, KS stats — AUC 0.880 p-value < 0.000, precision 0.6 p-value < 0.000, Recall 1.0 p-value < 0.000

: The synthetic method uplifts model recall, albeit at the associated fee of precision. While the impact on precision stays substantial, the synthetic approach provides a more resilient alternative for enhancing model recall with less of a detriment to precision compared to the oversampling approach. The robustness of the synthetic approach is further evidenced by the uplift in AUC-PR.