The very first thing to note is that despite the fact that there’s no explicit regularisation there are relatively smooth boundaries. For instance, in the highest left there happened to be a little bit of sparse sampling (by likelihood) yet each models prefer to chop off one tip of the star somewhat than predicting a more complex shape around the person points. That is a vital reminder that many architectural decisions act as implicit regularisers.
From our evaluation we might expect focal loss to predict complicated boundaries in areas of natural complexity. Ideally, this might be a bonus of using the focal loss. But when we inspect one in all the areas of natural complexity we see that each models fail to discover that there’s an extra shape contained in the circles.
In regions of sparse data (dead zones) we might expect focal loss to create more complex boundaries. This isn’t necessarily desirable. If the model hasn’t learned any of the underlying patterns of the information then there are infinitely some ways to attract a boundary around sparse points. Here we are able to contrast two sparse areas and spot that focal loss has predicted a more complex boundary than the cross entropy:
The highest row is from the central star and we are able to see that the focal loss has learned more concerning the pattern. The anticipated boundary within the sparse region is more complex but in addition more correct. The underside row is from the lower right corner and we are able to see that the expected boundary is more complicated however it hasn’t learned a pattern concerning the shape. The sleek boundary predicted by BCE is likely to be more desirable than the strange shape predicted by focal loss.
This qualitative evaluation doesn’t assist in determining which one is best. How can we quantify it? The 2 loss functions produce different values that may’t be compared directly. As a substitute we’re going to check the accuracy of predictions. We’ll use a regular F1 rating but note that different risk profiles might prefer extra weight on recall or precision.
To evaluate generalisation capability we use a validation set that’s iid with our training sample. We may also use early stopping to stop each approaches from overfitting. If we compare the validation losses of the 2 models we see a slight boost in F1 scores using focal loss vs binary cross entropy.
- BCE Loss: 0.936 (Validation F1)
- Focal Loss: 0.954 (Validation F1)
So plainly the model trained with focal loss performs barely higher when applied on unseen data. To date, so good, right?
The difficulty with iid generalisation
In the usual definition of generalisation, future observations are assumed to be iid with our training distribution. But this won’t help if we would like our model to learn an efficient representation of the underlying process that generated the information. In this instance that process involves the shapes and the symmetries that determine the choice boundary. If our model has an internal representation of those shapes and symmetries then it should perform equally well in those sparsely sampled “dead zones”.
Neither model will ever work OOD because they’ve only seen data from one distribution and can’t generalise. And it will be unfair to expect otherwise. Nevertheless, we are able to concentrate on robustness within the sparse sampling regions. Within the paper Machine Learning Robustness: A Primer, they mostly speak about samples from the tail of the distribution which is something we saw in our house prices models. But here we’ve got a situation where sampling is sparse however it has nothing to do with an explicit “tail”. I’ll proceed to seek advice from this as an “endogenous sampling bias” to focus on that tails usually are not explicitly required for sparsity.
On this view of robustness the endogenous sampling bias is one possibility where models may not generalise. For more powerful models we may also explore OOD and adversarial data. Consider a picture model which is trained to recognise objects in urban areas but fails to work in a jungle. That will be a situation where we might expect a strong enough model to work OOD. Adversarial examples however would involve adding noise to a picture to alter the statistical distribution of colors in a way that’s imperceptible to humans but causes miss-classification from a non-robust model. But constructing models that resist adversarial and OOD perturbations is out of scope for this already long article.
Robustness to perturbation
So how will we quantify this robustness? We’ll start with an accuracy function A (we previously used the F1 rating). Then we consider a perturbation function φ which we are able to apply on each individual points or on a whole dataset. Note that this perturbation function should preserve the connection between predictor x and goal y. (i.e. we usually are not purposely mislabelling examples).
Consider a model designed to predict house prices in any city, an OOD perturbation may involve finding samples from cities not within the training data. In our example we’ll concentrate on a modified version of the dataset which samples exclusively from the sparse regions.
The robustness rating (R) of a model (h) is a measure of how well the model performs under a perturbed dataset in comparison with a clean dataset:
Consider the 2 models trained to predict a choice boundary: one trained with focal loss and one with binary cross entropy. Focal loss performed barely higher on the validation set which was iid with the training data. Yet we used that dataset for early stopping so there’s some subtle information leakage. Let’s compare results on:
- A validation set iid to our training set and used for early stopping.
- A test set iid to our training set.
- A perturbed (φ) test set where we only sample from the sparse regions I’ve called “dead zones”.
| Loss Type | Val (iid) F1 | Test (iid) F1 | Test (φ) F1 | R(φ) |
|------------|---------------|-----------------|-------------|---------|
| BCE Loss | 0.936 | 0.959 | 0.834 | 0.869 |
| Focal Loss | 0.954 | 0.941 | 0.822 | 0.874 |
The usual bias-variance decomposition suggested that we would get more robust results with focal loss by allowing increased complexity on hard examples. We knew that this may not be ideal in all circumstances so we evaluated on a validation set to substantiate. To date so good. But now that we have a look at the performance on a perturbed test set we are able to see that focal loss performed barely worse! Yet we also see that focal loss has a rather higher robustness rating. So what is happening here?
I ran this experiment several times, every time yielding barely different results. This was one surprising instance I wanted to focus on. The bias-variance decomposition is about how our model will perform in expectation (across different possible worlds). Against this this robustness approach tells us how these specific models perform under perturbation. But we made need more considerations for model selection.
There are loads of subtle lessons in these results:
- If we make significant decisions on our validation set (e.g. early stopping) then it becomes vital to have a separate test set.
- Even training on the identical dataset we are able to get varied results. When training neural networks there are multiple sources of randomness to contemplate which can develop into vital within the last a part of this text.
- A weaker model could also be more robust to perturbations. So model selection needs to contemplate greater than just the robustness rating.
- We might have to judge models on multiple perturbations to make informed decisions.
Comparing approaches to robustness
In a single approach to robustness we consider the impact of hyperparameters on model performance through the lens of the bias-variance trade-off. We are able to use this information to know how different kinds of coaching examples affect our training process. For instance, we all know that miss-labelled data is especially bad to make use of with focal loss. We are able to consider whether particularly hard examples may very well be excluded from our training data to supply more robust models. And we are able to higher understand the role of regularisation by consider the varieties of hyperparameters and the way they impact bias and variance.
The opposite perspective largely disregards the bias variance trade-off and focuses on how our model performs on perturbed inputs. For us this meant specializing in sparsely sampled regions but might also include out of distribution (OOD) and adversarial data. One drawback to this approach is that it’s evaluative and doesn’t necessarily tell us learn how to construct higher models wanting training on more (and more varied) data. A more significant drawback is that weaker models may exhibit more robustness and so we are able to’t exclusively use robustness rating for model selection.
Regularisation and robustness
If we take the usual model trained with cross entropy loss we are able to plot the performance on different metrics over time: training loss, validation loss, validation_φ loss, validation accuracy, and validation_φ accuracy. We are able to compare the training process under the presence of various sorts of regularisation to see the way it affects generalisation capability.
On this particular problem we are able to make some unusual observations
- As we might expect without regularisation, because the training loss tends towards 0 the validation loss starts to extend.
- The validation_φ loss increases way more significantly since it only accommodates examples from the sparse “dead zones”.
- However the validation accuracy doesn’t actually worsen because the validation loss increases. What is happening here? That is something I’ve actually seen in real datasets. The model’s accuracy improves however it also becomes increasingly confident in its outputs, so when it’s flawed the loss is kind of high. Using the model’s probabilities becomes useless as all of them are likely to 99.99% no matter how well the model does.
- Adding regularisation prevents the validation losses from blowing up because the training loss cannot go to 0. Nevertheless, it may well also negatively impact the validation accuracy.
- Adding dropout and weight decay is best than simply dropout, but each are worse than using no regularisation by way of accuracy.
Reflection
In the event you’ve stuck with me this far into the article I hope you’ve developed an appreciation for the restrictions of the bias-variance trade-off. It should at all times be useful to have an understanding of the everyday relationship between model complexity and expected performance. But we’ve seen some interesting observations that challenge the default assumptions:
- Model complexity can change in several parts of the feature space. Hence, a single measure of complexity vs bias/variance doesn’t at all times capture the entire story.
- The usual measures of generalisation error don’t capture all sorts of generalisation, particularly lacking in robustness under perturbation.
- Parts of our training sample will be harder to learn from than others and there are multiple ways during which a training example will be considered “hard”. Complexity is likely to be needed in naturally complex regions of the feature space but problematic in sparse areas. This sparsity will be driven by endogenous sampling bias and so comparing performance to an iid test set can provide false impressions.
- As at all times we’d like to consider risk and risk minimisation. In the event you expect all future inputs to be iid with the training data it will be detrimental to concentrate on sparse regions or OOD data. Especially if tail risks don’t carry major consequences. However we’ve seen that tail risks can have unique consequences so it’s vital to construct an appropriate test set in your particular problem.
- Simply testing a model’s robustness to perturbations isn’t sufficient for model selection. A call concerning the generalisation capability of a model can only be done under a correct risk assessment.
- The bias-variance trade-off only concerns the expected loss for models averaged over possible worlds. It doesn’t necessarily tell us how accurate our model might be using hard classification boundaries. This could result in counter-intuitive results.
Let’s review among the assumptions that were key to our bias-variance decomposition:
- At low complexity, the whole error is dominated by bias, while at high complexity total error is dominated by variance. With bias ≫ variance on the minimum complexity.
- As a function of complexity bias is monotonically decreasing and variance is monotonically increasing.
- The complexity function g is differentiable.
It seems that with sufficiently deep neural networks those first two assumptions are incorrect. And that last assumption may be a convenient fiction to simplify some calculations. We won’t query that one but we’ll be taking a have a look at the primary two.
Let’s briefly review what it means to overfit:
- A model overfits when it fails to differentiate noise (aleatoric uncertainty) from intrinsic variation. Because of this a trained model may behave wildly otherwise given different training data with different noise (i.e. variance).
- We notice a model has overfit when it fails to generalise to an unseen test set. This typically means performance on test data that’s iid with the training data. We may concentrate on different measures of robustness and so craft a test set which is OOS, stratified, OOD, or adversarial.
We’ve up to now assumed that the one option to get truly low bias is that if a model is overly complex. And we’ve assumed that this complexity results in high variance between models trained on different data. We’ve also established that many hyperparameters contribute to complexity including the variety of epochs of stochastic gradient descent.
Overparameterisation and memorisation
You will have heard that a big neural network can simply memorise the training data. But what does that mean? Given sufficient parameters the model doesn’t have to learn the relationships between features and outputs. As a substitute it may well store a function which responds perfectly to the features of each training example completely independently. It might be like writing an explicit if statement for each combination of features and easily producing the typical output for that feature. Consider our decision boundary dataset where every example is totally separable. That will mean 100% accuracy for the whole lot within the training set.
If a model has sufficient parameters then the gradient descent algorithm will naturally use all of that space to do such memorisation. Generally it’s believed that this is far simpler than finding the underlying relationship between the features and the goal values. This is taken into account the case when p ≫ N (the variety of trainable parameters is significantly larger than the variety of examples).
But there are 2 situations where a model can learn to generalise despite having memorised training data:
- Having too few parameters results in weak models. Adding more parameters results in a seemingly optimal level of complexity. Continuing so as to add parameters makes the model perform worse because it starts to suit to noise within the training data. Once the variety of parameters exceeds the number of coaching examples the model may begin to perform higher. Once p ≫ N the model reaches one other optimal point.
- Train a model until the training and validation losses begin to diverge. The training loss tends towards 0 because the model memorises the training data however the validation loss blows up and reaches a peak. After some (prolonged) training time the validation loss starts to diminish.
That is generally known as the “double descent” phenomena where additional complexity actually leads to raised generalisation.
Does double descent require mislabelling?
One general consensus is that label noise is sufficient but not needed for double descent to occur. For instance, the paper Unravelling The Enigma of Double Descent found that overparameterised networks will learn to assign the mislabelled class to points within the training data as a substitute of learning to disregard the noise. Nevertheless, a model may “isolate” these points and learn general features around them. It mainly focuses on the learned features throughout the hidden states of neural networks and shows that separability of those learned features could make labels noisy even without mislabelling.
The paper Double Descent Demystified describes several needed conditions for double descent to occur in generalised linear models. These criteria largely concentrate on variance throughout the data (versus model variance) which make it difficult for a model to appropriately learn the relationships between predictor and goal variables. Any of those conditions can contribute to double descent:
- The presence of singular values.
- That the test set distribution is just not effectively captured by features which account for probably the most variance within the training data.
- An absence of variance for a wonderfully fit model (i.e. a wonderfully fit model seems to haven’t any aleatoric uncertainty).
This paper also captures the double descent phenomena for a toy problem with this visualisation:
Against this the paper Understanding Double Descent Requires a Fantastic-Grained Bias-Variance Decomposition gives an in depth mathematical breakdown of various sources of noise and their impact on variance:
- Sampling — the final concept that fitting a model to different datasets results in models with different predictions (V_D)
- Optimisation — the results of parameters initialisation but potentially also the character of stochastic gradient descent (V_P).
- Label noise — generally mislabelled examples (V_ϵ).
- The potential interactions between the three sources of variance.
The paper goes on to point out that a few of these variance terms actually contribute to the whole error as a part of a model’s bias. Moreover, you possibly can condition the expectation calculation first on V_D or V_P and it means you reach different conclusions depending on the way you do the calculation. A correct decomposition involves understanding how the whole variance comes together from interactions between the three sources of variance. The conclusion is that while label noise exacerbates double descent it is just not needed.
Regularisation and double descent
One other consensus from these papers is that regularisation may prevent double descent. But as we saw within the previous section that doesn’t necessarily mean that the regularised model will generalise higher to unseen data. It more appears to be the case that regularisation acts as a floor for the training loss, stopping the model from taking the training loss arbitrarily low. But as we all know from the bias-variance trade-off, that would limit complexity and introduce bias to our models.
Reflection
Double descent is an interesting phenomenon that challenges most of the assumptions used throughout this text. We are able to see that under the fitting circumstances increasing complexity doesn’t necessarily degrade a model’s ability to generalise.
Should we predict of highly complex models as special cases or do they call into query your entire bias-variance trade-off. Personally, I feel that the core assumptions hold true generally and that highly complex models are only a special case. I feel the bias-variance trade-off has other weaknesses however the core assumptions are likely to be valid.
The bias-variance trade-off is comparatively straightforward on the subject of statistical inference and more typical statistical models. I didn’t go into other machine learning methods like decisions trees or support vector machines, but much of what we’ve discussed continues to use there. But even in these settings we’d like to contemplate more aspects than how well our model may perform if averaged over all possible worlds. Mainly because we’re comparing the performance against future data assumed to be iid with our training set.
Even when our model will only ever see data that appears like our training distribution we are able to still face large consequences with tail risks. Most machine learning projects need a correct risk assessment to know the results of mistakes. As a substitute of evaluating models under iid assumptions we ought to be constructing validation and test sets which fit into an appropriate risk framework.
Moreover, models that are purported to have general capabilities should be evaluated on OOD data. Models which perform critical functions should be evaluated adversarially. It’s also value mentioning that the bias-variance trade-off isn’t necessarily valid within the setting of reinforcement learning. Consider the alignment problem in AI safety which considers model performance beyond explicitly stated objectives.
We’ve also seen that within the case of huge overparameterised models the usual assumptions about over- and underfitting simply don’t hold. The double descent phenomena is complex and still poorly understood. Yet it holds a vital lesson about trusting the validity of strongly held assumptions.
For many who’ve continued this far I have the desire to make one last connection between different sections of this text. Within the section in inferential statistics I explained that Fisher information describes the quantity of data a sample can contain concerning the distribution the sample was drawn from. In various parts of this text I’ve also mentioned that there are infinitely some ways to attract a choice boundary around sparsely sampled points. There’s an interesting query about whether there’s enough information in a sample to attract conclusions about sparse regions.
In my article on why scaling works I talk concerning the concept of an inductive prior. That is something introduced by the training process or model architecture we’ve chosen. These inductive priors bias the model into ensuring sorts of inferences. For instance, regularisation might encourage the model to make smooth somewhat than jagged boundaries. With a special type of inductive prior it’s possible for a model to glean more information from a sample than can be possible with weaker priors. For instance, there are methods to encourage symmetry, translation invariance, and even detecting repeated patterns. These are normally applied through feature engineering or through architecture decisions like convolutions or the eye mechanism.
I first began putting together the notes for this text over a yr ago. I had one experiment where focal loss was vital for getting decent performance from my model. Then I had several experiments in a row where focal loss performed terribly for no apparent reason. I began digging into the bias-variance trade-off which led me down a rabbit hole. Eventually I learned more about double descent and realised that the bias-variance trade-off had loads more nuance than I’d previously believed. In that point I read and annotated several papers on the subject and all my notes were just collecting digital dust.
Recently I realised that through the years I’ve read loads of terrible articles on the bias-variance trade-off. The thought I felt was missing is that we’re calculating an expectation over “possible worlds”. That insight may not resonate with everyone however it seems vital to me.
I also wish to comment on a well-liked visualisation about bias vs variance which uses archery shots spread around a goal. I feel that this visual is misleading since it makes it seem that bias and variance are about individual predictions of a single model. Yet the mathematics behind the bias-variance error decomposition is clearly about performance averaged across possible worlds. I’ve purposely avoided that visualisation for that reason.
I’m undecided how many individuals will make it all over to the top. I put these notes together long before I began writing about AI and felt that I should put them to good use. I also just needed to get the ideas out of my head and written down. So for those who’ve reached the top I hope you’ve found my observations insightful.
[1] “German tank problem,” Wikipedia, Nov. 26, 2021. https://en.wikipedia.org/wiki/German_tank_problem
[2] Wikipedia Contributors, “Minimum-variance unbiased estimator,” Wikipedia, Nov. 09, 2019. https://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator
[3] “Likelihood function,” Wikipedia, Nov. 26, 2020. https://en.wikipedia.org/wiki/Likelihood_function
[4] “Fisher information,” Wikipedia, Nov. 23, 2023. https://en.wikipedia.org/wiki/Fisher_information
[5] Why, “Why is using squared error the usual when absolute error is more relevant to most problems?,” Cross Validated, Jun. 05, 2020. https://stats.stackexchange.com/questions/470626/w (accessed Nov. 26, 2024).
[6] Wikipedia Contributors, “Bias–variance tradeoff,” Wikipedia, Feb. 04, 2020. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
[7] B. Efron, “Prediction, Estimation, and Attribution,” International Statistical Review, vol. 88, no. S1, Dec. 2020, doi: https://doi.org/10.1111/insr.12409.
[8] T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning. Springer, 2009.
[9] T. Dzekman, “Medium,” Medium, 2024. https://medium.com/towards-data-science/why-scalin (accessed Nov. 26, 2024).
[10] H. Braiek and F. Khomh, “Machine Learning Robustness: A Primer,” 2024. Available: https://arxiv.org/pdf/2404.00897
[11] O. Wu, W. Zhu, Y. Deng, H. Zhang, and Q. Hou, “A Mathematical Foundation for Robust Machine Learning based on Bias-Variance Trade-off,” arXiv.org, 2021. https://arxiv.org/abs/2106.05522v4 (accessed Nov. 26, 2024).
[12] “bias_variance_decomp: Bias-variance decomposition for classification and regression losses — mlxtend,” rasbt.github.io. https://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp
[13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” arXiv:1708.02002 [cs], Feb. 2018, Available: https://arxiv.org/abs/1708.02002
[14] Y. Gu, X. Zheng, and T. Aste, “Unraveling the Enigma of Double Descent: An In-depth Evaluation through the Lens of Learned Feature Space,” arXiv.org, 2023. https://arxiv.org/abs/2310.13572 (accessed Nov. 26, 2024).
[15] R. Schaeffer et al., “Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle,” arXiv.org, 2023. https://arxiv.org/abs/2303.14151 (accessed Nov. 26, 2024).
[16] B. Adlam and J. Pennington, “Understanding Double Descent Requires a Fantastic-Grained Bias-Variance Decomposition,” Neural Information Processing Systems, vol. 33, pp. 11022–11032, Jan. 2020.