for an Old Challenge
You’re training your model for spam detection. Your dataset has many more positives than negatives, so that you invest countless hours of labor to rebalance it to a 50/50 ratio. Now you might be satisfied since you were capable of address the category imbalance. What if I told you that 60/40 might have been not simply enough, but even higher?
In most machine learning classification applications, the variety of instances of 1 class outnumbers that of other classes. This slows down learning [1] and might potentially induce biases within the trained models [2]. Essentially the most widely used methods to deal with this depend on an easy prescription: finding a option to give all classes the identical weight. Most frequently, this is finished through easy methods comparable to giving more importance to minority class examples (reweighting), removing majority class examples from the dataset (undersampling), or including minority class instances greater than once (oversampling).
The validity of those methods is usually discussed, with each theoretical and empirical work indicating that which solution works best depends upon your specific application [3]. Nevertheless, there’s a hidden hypothesis that’s seldom discussed and too often taken with no consideration: Is rebalancing even idea? To some extent, these methods work, so the reply is yes. But should we rebalance our datasets? To make it easy, allow us to take a binary classification problem. Should we rebalance our training data to have 50% of every class? Intuition says yes, and intuition guided practice until now. On this case, intuition is incorrect. For intuitive reasons.
What Do We Mean by ‘Training Imbalance’?
Before we delve into how and why 50% just isn’t the optimal training imbalance in binary classification, allow us to define some relevant quantities. We call ₀ the variety of instances of 1 class (often, the minority class), and ₁ those of the opposite class. This manner, the overall number of information instances within the training set is =₀+₁ . The amount we analyze today is the training imbalance,
⁽ᵗʳᵃⁱⁿ⁾ = ₀/ .
Evidence that fifty% Is Suboptimal
Initial evidence comes from empirical work on random forests. Kamalov and collaborators measured the optimal training imbalance, ⁽ᵒᵖᵗ⁾, on 20 datasets [4]. They find its value varies from problem to problem, but conclude that it’s kind of ⁽ᵒᵖᵗ⁾=43%. Because of this, in accordance with their experiments, you would like barely more majority than minority class examples. That is nevertheless not the total story. If you need to aim at optimal models, don’t stop here and straightaway set your ⁽ᵗʳᵃⁱⁿ⁾ to 43%.
The truth is, this 12 months, theoretical work by Pezzicoli [5], showed that the the optimal training imbalance just isn’t a universal value that’s valid for all applications. It just isn’t 50% and it just isn’t 43%. It seems, the optimal imbalance varies. It could possibly some times be smaller than 50% (as Kamalov and collaborators measured), and others larger than 50%. The precise value of ⁽ᵒᵖᵗ⁾ will rely upon details of every specific classification problem. One option to find ⁽ᵒᵖᵗ⁾ is to coach the model for several values of ⁽ᵗʳᵃⁱⁿ⁾, and measure the related performance. This might for instance appear like this:
Although the precise patterns determining ⁽ᵒᵖᵗ⁾ are still unclear, plainly when data is abundant in comparison with the model size, the optimal imbalance is smaller than 50%, as in Kamalov’s experiments. Nevertheless, many other aspects — from how intrinsically rare minority instances are, to how noisy the training dynamics is — come together to set the optimal value of the training imbalance, and to find out how much performance is lost when one trains away from ⁽ᵒᵖᵗ⁾.
Why Perfect Balance Isn’t At all times Best
As we said, the reply is definitely intuitive: as different classes have different properties, there is no such thing as a reason why each classes would carry the identical information. The truth is, Pezzicoli’s team proved that they typically don’t. Subsequently, to infer the very best decision boundary we would need more instances of a category than of the opposite. Pezzicoli’s work, which is within the context of anomaly detection, provides us with an easy and insightful example.
Allow us to assume that the info comes from a multivariate Gaussian distribution, and that we label all of the points to the best of a call boundary as anomalies. In 2D, it might appear like this:

The dashed line is our decision boundary, and the points on the best of the choice boundary are the ₀ anomalies. Allow us to now rebalance our dataset to ⁽ᵗʳᵃⁱⁿ⁾=0.5. To achieve this, we’d like to search out more anomalies. For the reason that anomalies are rare, those who we’re almost certainly to search out are near the choice boundary. Already by eye, the scenario is strikingly clear:

Anomalies, in yellow, are stacked along the choice boundary, and are subsequently more informative about its position than the blue points. This might induce to think that it is best to privilege minority class points. On the opposite side, anomalies only cover one side of the choice boundary, so once one has enough minority class points, it might grow to be convenient to speculate in additional majority class points, with a purpose to higher cover the opposite side of the choice boundary. As a consequence of those two competing effects, ⁽ᵒᵖᵗ⁾ is mostly not 50%, and its exact value is problem dependent.
The Root Cause Is Class Asymmetry
Pezzicoli’s theory shows that the optimal imbalance is mostly different from 50%, because different classes have different properties. Nevertheless, they only analyze one source of diversity amongst classes, that’s, outlier behavior. Yet, because it is for instance shown by Sarao-Mannelli and coauthors [6], there are a number of effects, comparable to the presence of subgroups inside classes, which may produce an identical effect. It’s the concurrence of a really large variety of effects determining diversity amongst classes, that tells us what the optimal imbalance for our specific problem is. Until we’ve a theory that treats all sources of asymmetry in the info together (including those induced by how the model architecture processes them), we cannot know the optimal training imbalance of a dataset beforehand.
Key Takeaways & What You Can Do Otherwise
If until now you rebalanced your binary dataset to 50%, you were doing well, but you were almost certainly not doing the very best possible. Although we still shouldn’t have a theory that may tell us what the optimal training imbalance needs to be, now you already know that it is probably going not 50%. The excellent news is that it’s on the best way: machine learning theorists are actively addressing this topic. Within the meantime, you may consider ⁽ᵗʳᵃⁱⁿ⁾ as a hyperparameter which you’ll be able to tune beforehand, just as every other hyperparameter, to rebalance your data in essentially the most efficient way. So before your next model training run, ask yourself: is 50/50 really optimal? Try tuning your class imbalance — your model’s performance might surprise you.
References
[1] E. Francazi, M. Baity-Jesi, and A. Lucchi, A theoretical evaluation of the training dynamics under class imbalance (2023), ICML 2023
[2] K. Ghosh, C. Bellinger, R. Corizzo, P. Branco,B. Krawczyk,and N. Japkowicz, The category imbalance problem in deep learning (2024), , (7), 4845–4901
[3] E. Loffredo, M. Pastore, S. Cocco and R. Monasson, Restoring balance: principled under/oversampling of information for optimal classification (2024), ICML 2024
[4] F. Kamalov, A.F. Atiya and D. Elreedy, Partial resampling of imbalanced data (2022),
[5] F.S. Pezzicoli, V. Ros, F.P. Landes and M. Baity-Jesi, Class imbalance in anomaly detection: Learning from an exactly solvable model (2025). AISTATS 2025
[6] S. Sarao-Mannelli, F. Gerace, N. Rostamzadeh and L. Saglietti, Bias-inducing geometries: an exactly solvable data model with fairness implications (2022),