has collected data on the outcomes of patients who’ve acquired “Pathogen A” answerable for an infectious respiratory illness. Available are 8 features of every patient and the consequence: (a) treated at home and recovered, (b) hospitalized and recovered, or (c) died.
It has proven trivial to coach a neural net to predict certainly one of the three outcomes from the 8 features with almost complete accuracy. Nevertheless, the health authorities would really like to predict something that was not captured: From the patients who could be treated at home, who’re those who’re most at danger of getting to go to hospital? And from the patients who’re predicted to be hospitalized, who’re those who’re most at danger of not surviving the infection? Can we get a numeric rating that represents how serious the infection will likely be?
On this note I’ll cover a neural net with a bottleneck and a special head to learn a scoring system from a number of categories, and canopy some properties of small neural networks one is prone to encounter. The accompanying code could be found at https://codeberg.org/csirmaz/category-scoring.
The dataset
To have the opportunity for example the work, I developed a toy example, which is a non-linear but deterministic piece of code calculating the consequence from the 8 features. The calculation is for illustration only — it will not be presupposed to be faithful to the science; the names of the features used were chosen merely to be consistent with the medical example. The 8 features utilized in this note are:
- Previous infection with Pathogen A (boolean)
- Previous infection with Pathogen B (boolean)
- Acute / current infection with Pathogen B (boolean)
- Cancer diagnosis (boolean)
- Weight deviation from average, arbitrary unit (-100 ≤ x ≤ 100)
- Age, years (0 ≤ x ≤ 100)
- Blood pressure deviation from average, arbitrary unit (0 ≤ x ≤ 100)
- Years smoked (0 ≤ x ≤ ~88)
When generating sample data, the features are chosen independently and from a uniform distribution, apart from years smoked, which is dependent upon the age, and a cohort of non-smokers (50%) was inbuilt. We checked that with this sampling the three outcomes occur with roughly equal probability, and measured the mean and variance of the variety of years smoked so we could normalize all of the inputs to zero mean unit variance.
As an illustration of the toy example, below is a plot of the outcomes with the burden on the horizontal axis and age on the vertical axis, and other parameters fixed. “o” stands for hospitalization and “+” for death.
....................
....................
....................
....................
...............ooooo
............oooooooo
............oooooooo
............oooooooo
............oooooooo
............oooooooo
............ooooooo+
...........ooooooo++
...........oooooo+++
...........oooooo+++
...........ooooo++++
.......oooooooo+++++
..oooooooooooo++++++
ooooooooooooo+++++++
oooooooooooo++++++++
ooooooooooo+++++++++
A classic classifier
The information is nonlinear but very neat, and so it isn’t any surprise that a small classifier network can learn it to 98-99% validation accuracy. Launch train.py --classifier to coach a straightforward neural network with 6 layers (each 8 wide) and ReLU activation, defined in ScoringModel.build_classifier_model().
But the right way to train a scoring system?
Our aim is then to coach a system that, given the 8 features as inputs, can produce a rating corresponding to the danger the patient is in when infected with Pathogen A. The complication is that now we have no scores available in our training data, only the three outcomes (categories). To be sure that the scoring system is meaningful, we would really like certain rating ranges to correspond to the three essential outcomes.
The very first thing someone may try is to assign a numeric value to every category, like 0 to home treatment, 1 to hospitalization and a couple of to death, and use it because the goal. Then arrange a neural network with a single output, and train it with e.g. MSE loss.
The issue with this approach is that the model will learn to contort (condense and expand) the projection of the inputs across the three targets, so ultimately the model will all the time return a worth near 0, 1 or 2. You’ll be able to do this by running train.py --predict-score which trains a model with 2 dense layers with ReLU activations and a final dense layer with a single output, defined in ScoringModel.build_predict_score_model().
As could be seen in the next histogram of the output of the model on a random batch of inputs, it’s indeed what is going on – and that is with 2 layers only.
..................................................#.........
..................................................#.........
.........#........................................#.........
.........#........................................#.........
.........#........................................#.........
.........#...................#....................#.........
.........#...................#...................##.........
.........#...................#...................##.........
.........###....#............##.#................##.........
........####.#.##.#..#..##.####.##..........#...###.........
Step 1: A low-capacity network
To avoid this from happening and get a more continuous rating, we would like to drastically reduce the capability of the network to contort the inputs. We are going to go to the intense and use a linear regression — in a previous TDS article I already described the right way to use the components offered by Keras to “train” one. We are going to reuse that concept here — and construct a “degenerate” neural network out of a single dense layer with no activation. This may allow the rating to maneuver more in keeping with the inputs, and likewise has the advantage that the resulting network is very interpretable, because it simply provides a weight for every input with the resulting rating being their linear combination.
Nevertheless, with this simplification, the model loses all ability to condense and expand the result to match the goal scores for every category. It’ll attempt to accomplish that, but especially with more output categories, there isn’t any guarantee that they’ll occur at regular intervals in any linear combination of the inputs.
We would like to enable the model to find out the perfect thresholds between the categories, that’s, to make the thresholds trainable parameters. That is where the “category approximator head” is available in.
Step 2: A category approximator head
With a view to have the opportunity to coach the model using the categories as targets, we add a head that learns to predict the category based on the rating. Our aim is to easily establish two thresholds (for our three categories), t0 and t1 such that
- if the rating < t0, then we predict treatment at home and recovery,
- if t0 < rating < t1, then we predict treatment in hospital and recovery,
- if t1 < rating, then we predict that the patient doesn't survive.
The model takes the form of an encoder-decoder, where the encoder part produces the rating, and the decoder part allows comparing and training the rating against the categories.
One approach is so as to add a dense layer on top of the rating, with a single input and as many outputs because the categories. This may learn the thresholds, and predict the chances of every category via softmax. Training then can occur as usual using a categorical cross-entropy loss.
Clearly, the dense layer won’t learn the thresholds directly; as an alternative, it can learn N weights and N biases given N output categories. So let’s work out the right way to get the thresholds from these.
Step 3: Extracting the thresholds
Notice that the output of the softmax layer is the vector of probabilities for every category; the anticipated category is the one with the very best probability. Moreover, softmax works in a way that it all the time maps the most important input value to the most important probability. Due to this fact, the most important output of the dense layer corresponds to the category that it predicts based on the incoming rating.
If the dense layer has learnt the weights [w1, w2, w3] and the biases [b1, b2, b3], then its outputs are
o1 = w1*rating + b1
o2 = w2*rating + b2
o3 = w3*rating + b3
These are all just straight lines as a function of the incoming rating (e.g. y = w1*x + b1), and whichever is at the highest at a given rating is the winning category. Here’s a quick illustration:
The thresholds are then the intersection points between the neighboring lines. Assuming the order of categories to be o1 (home) → o2 (hospital) → o3 (death), we want to unravel the o1 = o2 and o2 = o3 equations, yielding
t0 = (b2 – b1) / (w1 – w2)
t1 = (b3 – b2) / (w2 – w3)
That is implemented in ScoringModel.extract_thresholds() (though there may be some additional logic there explained below).
Step 4: Ordering the categories
But how can we know what’s the fitting order of the categories? Clearly now we have a preferred order (home → hospital → death), but what is going to the model say?
It’s value noting a few things in regards to the lines that represent which category wins at each rating. As we’re curious about whichever line is the very best, we’re talking in regards to the boundary of the region that’s above all lines:
Since this area is the intersection of all half-planes which are above each line, it’s necessarily convex. (Note that no line could be vertical.) Because of this each category wins over exactly one range of scores; it cannot get back to the highest again later.
It also signifies that these ranges are necessarily within the order of the slopes of the lines, that are the weights. The biases influence the values of the thresholds, but not the order. We first have negative slopes, followed by small after which big positive slopes.
It’s because given any two lines, towards negative infinity the one with the smaller slope (weight) will win, and towards positive infinity, the opposite. Algebraically speaking, given two lines
f1(x) = w1*x + b1 and f2(x) = w2*x + b2 where w2 > w1,
we already know they intersect at (b2 – b1) / (w1 – w2), and below this, if x < (b2 – b1) / (w1 – w2), then
(w1 – w2)x > b2 – b1 (w1 – w2 is negative!)
w1*x + b1 > w2*x – b2
f1(x) > f2(x),
and so f1 wins. The identical argument holds in the opposite direction.
Step 4.5: We tousled (propagate-sum)
And here lies an issue: the scoring model is kind of free to choose what order to place the categories in. That’s not good: a rating that predicts death at 0, home treatment at 10, and hospitalization at 20 is clearly nonsensical. Nevertheless, with certain inputs (especially if one feature dominates a category) this could occur even with very simple scoring models like a linear regression.
There may be a method to protect against this though. Keras allows adding a kernel constraint to a dense layer to force all weights to be non-negative. We could take this code and implement a kernel constraint that forces the weights to be in increasing order (w1 ≤ w2 ≤ w3), however it is easier if we stick with the available tools. Fortunately, Keras tensors support slicing and concatenation, so we are able to split the outputs of the dense layer into components (say, d1, d2, d3) and use the next because the input into the softmax:
- o1 = d1
- o2 = d1 + d2
- o3 = d1 + d2 + d3
Within the code, that is known as “propagate sum.”
Substituting the weights and biases into the above we get
- o1 = w1*rating + b1
- o2 = (w1+w2)*rating + b1+b2
- o3 = (w1+w2+w3)*rating + b1+b2+b3
Since w1, w2, w3 are all non-negative, now we have now ensured that the effective weights used to choose the winning category are in increasing order.
Step 5: Training and evaluating
All of the components at the moment are together to coach the linear regression. The model is implemented in ScoringModel.build_linear_bottleneck_model() and could be trained by running train.py --linear-bottleneck. The code also routinely extracts the thresholds and the weights of the linear combination after each epoch. Note that as a final calculation, we want to shift each threshold by the bias within the encoder layer.
Epoch #4 finished. Logs: {'accuracy': 0.7988250255584717, 'loss': 0.4569114148616791, 'val_accuracy': 0.7993124723434448, 'val_loss': 0.4509878158569336}
----- Evaluating the bottleneck model -----
Prev infection A weight: -0.22322197258472443
Prev infection B weight: -0.1420486718416214
Acute infection B weight: 0.43141448497772217
Cancer diagnosis weight: 0.48094701766967773
Weight deviation weight: 1.1893583536148071
Age weight: 1.4411307573318481
Blood pressure dev weight: 0.8644841313362122
Smoked years weight: 1.1094108819961548
Threshold: -1.754680637036648
Threshold: 0.2920824065597968
The linear regression can approximate the toy example with an accuracy of 80%, which is pretty good. Naturally, the utmost achievable accuracy is dependent upon whether the system to be modeled is near linear or not. If not, one can think about using a more capable network because the encoder; for instance, a number of dense layers with nonlinear activations. The network should still not have enough capability to condense the projected rating an excessive amount of.
It is usually value noting that with the linear combination, the dimensionality of the burden space the training happens in is minuscule in comparison with regular neural networks (just N where N is the variety of input features, in comparison with thousands and thousands, billions or more). There may be a regularly described intuition that on high-dimensional error surfaces, real local minima and maxima are very rare – there is sort of all the time a direction wherein training can proceed to scale back loss. That's, most areas of zero gradient are saddle points. We don't have this luxury in our 8-dimensional weight space, and indeed, training can get stuck in local extrema even with optimizers like Adam. Training is amazingly fast though, and running multiple training sessions can solve this problem.
As an instance how the learnt linear model functions, ScoringModel.try_linear_model() tries it on a set of random inputs. Within the output, the goal and predicted outcomes are noted by their index number (0: treatment at home, 1: hospitalized, 2: death):
Sample #0: goal=1 rating=-1.18 predicted=1 okay
Sample #1: goal=2 rating=+4.57 predicted=2 okay
Sample #2: goal=0 rating=-1.47 predicted=1 x
Sample #3: goal=2 rating=+0.89 predicted=2 okay
Sample #4: goal=0 rating=-5.68 predicted=0 okay
Sample #5: goal=2 rating=+4.01 predicted=2 okay
Sample #6: goal=2 rating=+1.65 predicted=2 okay
Sample #7: goal=2 rating=+4.63 predicted=2 okay
Sample #8: goal=2 rating=+7.33 predicted=2 okay
Sample #9: goal=2 rating=+0.57 predicted=2 okay
And ScoringModel.visualize_linear_model() generates a histogram of the rating from a batch of random inputs. As above, “.” notes home treatment, “o” stands for hospitalization, and “+” death. For instance:
+
+
+
+ +
+ +
. o + + + +
.. .. . o oo ooo o+ + + ++ + + + +
.. .. . o oo ooo o+ + + ++ + + + +
.. .. . . .... . o oo oooooo+ ++ + ++ + + + + + + +
.. .. . . .... . o oo oooooo+ ++ + ++ + + + + + + +
The histogram is spiky attributable to the boolean inputs, which (before normalization) are either 0 or 1 within the linear combination, but the general histogram continues to be much smoother than the outcomes we got with the 2-layer neural network above. Many input vectors are mapped to scores which are on the thresholds between the outcomes, allowing us to predict if a patient is dangerously near getting hospitalized, or needs to be admitted to intensive care as a precaution.
Conclusion
Easy models like linear regressions and other low-capacity networks have desirable properties in quite a few applications. They're highly interpretable and verifiable by humans – for instance, from the outcomes of the toy example above we are able to clearly see that previous infections protect patients from worse outcomes, and that age is crucial consider determining the severity of an ongoing infection.
One other property of linear regressions is that their output moves roughly in keeping with their inputs. It is that this feature that we used to amass a comparatively smooth, continuous rating from just a number of anchor points offered by the limited information available within the training data. Furthermore, we did so based on well-known network components available in major frameworks including Keras. Finally, we used a little bit of math to extract the data we want from the trainable parameters within the model, and to be sure that the rating learnt is meaningful, that's, that it covers the outcomes (categories) in the specified order.
Small, low-capacity models are still powerful tools to unravel the fitting problems. With quick and low cost training, they will also be implemented, tested and iterated over extremely quickly, fitting nicely into agile approaches to development and engineering.