Estimating from No Data: Deriving a Continuous Rating from Categories

-

has collected data on the outcomes of patients who’ve acquired “Pathogen A” answerable for an infectious respiratory illness. Available are 8 features of every patient and the consequence: (a) treated at home and recovered, (b) hospitalized and recovered, or (c) died.

It has proven trivial to coach a neural net to predict certainly one of the three outcomes from the 8 features with almost complete accuracy. Nevertheless, the health authorities would really like to predict something that was not captured: From the patients who could be treated at home, who’re those who’re most at danger of getting to go to hospital? And from the patients who’re predicted to be hospitalized, who’re those who’re most at danger of not surviving the infection? Can we get a numeric rating that represents how serious the infection will likely be?

On this note I’ll cover a neural net with a bottleneck and a special head to learn a scoring system from a number of categories, and canopy some properties of small neural networks one is prone to encounter. The accompanying code could be found at https://codeberg.org/csirmaz/category-scoring.

The dataset

To have the opportunity for example the work, I developed a toy example, which is a non-linear but deterministic piece of code calculating the consequence from the 8 features. The calculation is for illustration only — it will not be presupposed to be faithful to the science; the names of the features used were chosen merely to be consistent with the medical example. The 8 features utilized in this note are:

  • Previous infection with Pathogen A (boolean)
  • Previous infection with Pathogen B (boolean)
  • Acute / current infection with Pathogen B (boolean)
  • Cancer diagnosis (boolean)
  • Weight deviation from average, arbitrary unit (-100 ≤ x ≤ 100)
  • Age, years (0 ≤ x ≤ 100)
  • Blood pressure deviation from average, arbitrary unit (0 ≤ x ≤ 100)
  • Years smoked (0 ≤ x ≤ ~88)

When generating sample data, the features are chosen independently and from a uniform distribution, apart from years smoked, which is dependent upon the age, and a cohort of non-smokers (50%) was inbuilt. We checked that with this sampling the three outcomes occur with roughly equal probability, and measured the mean and variance of the variety of years smoked so we could normalize all of the inputs to zero mean unit variance.

As an illustration of the toy example, below is a plot of the outcomes with the burden on the horizontal axis and age on the vertical axis, and other parameters fixed. “o” stands for hospitalization and “+” for death.

....................
....................
....................
....................
...............ooooo
............oooooooo
............oooooooo
............oooooooo
............oooooooo
............oooooooo
............ooooooo+
...........ooooooo++
...........oooooo+++
...........oooooo+++
...........ooooo++++
.......oooooooo+++++
..oooooooooooo++++++
ooooooooooooo+++++++
oooooooooooo++++++++
ooooooooooo+++++++++

A classic classifier

The information is nonlinear but very neat, and so it isn’t any surprise that a small classifier network can learn it to 98-99% validation accuracy. Launch train.py --classifier to coach a straightforward neural network with 6 layers (each 8 wide) and ReLU activation, defined in ScoringModel.build_classifier_model().

But the right way to train a scoring system?

Our aim is then to coach a system that, given the 8 features as inputs, can produce a rating corresponding to the danger the patient is in when infected with Pathogen A. The complication is that now we have no scores available in our training data, only the three outcomes (categories). To be sure that the scoring system is meaningful, we would really like certain rating ranges to correspond to the three essential outcomes.

The very first thing someone may try is to assign a numeric value to every category, like 0 to home treatment, 1 to hospitalization and a couple of to death, and use it because the goal. Then arrange a neural network with a single output, and train it with e.g. MSE loss.

The issue with this approach is that the model will learn to contort (condense and expand) the projection of the inputs across the three targets, so ultimately the model will all the time return a worth near 0, 1 or 2. You’ll be able to do this by running train.py --predict-score which trains a model with 2 dense layers with ReLU activations and a final dense layer with a single output, defined in ScoringModel.build_predict_score_model().

First attempt at learning a rating (see build_predict_score_model). Image by creator

As could be seen in the next histogram of the output of the model on a random batch of inputs, it’s indeed what is going on – and that is with 2 layers only.

..................................................#.........
..................................................#.........
.........#........................................#.........
.........#........................................#.........
.........#........................................#.........
.........#...................#....................#.........
.........#...................#...................##.........
.........#...................#...................##.........
.........###....#............##.#................##.........
........####.#.##.#..#..##.####.##..........#...###.........

Step 1: A low-capacity network

To avoid this from happening and get a more continuous rating, we would like to drastically reduce the capability of the network to contort the inputs. We are going to go to the intense and use a linear regression — in a previous TDS article I already described the right way to use the components offered by Keras to “train” one. We are going to reuse that concept here — and construct a “degenerate” neural network out of a single dense layer with no activation. This may allow the rating to maneuver more in keeping with the inputs, and likewise has the advantage that the resulting network is very interpretable, because it simply provides a weight for every input with the resulting rating being their linear combination.

Nevertheless, with this simplification, the model loses all ability to condense and expand the result to match the goal scores for every category. It’ll attempt to accomplish that, but especially with more output categories, there isn’t any guarantee that they’ll occur at regular intervals in any linear combination of the inputs.

We would like to enable the model to find out the perfect thresholds between the categories, that’s, to make the thresholds trainable parameters. That is where the “category approximator head” is available in.

Step 2: A category approximator head

With a view to have the opportunity to coach the model using the categories as targets, we add a head that learns to predict the category based on the rating. Our aim is to easily establish two thresholds (for our three categories), t0 and t1 such that

  • if the rating < t0, then we predict treatment at home and recovery,
  • if t0 < rating < t1, then we predict treatment in hospital and recovery,
  • if t1 < rating, then we predict that the patient doesn't survive.

The model takes the form of an encoder-decoder, where the encoder part produces the rating, and the decoder part allows comparing and training the rating against the categories.

Neural network diagram showing a dense layer with a single output, another dense layer expanding this to three outputs and a softmax layer
Second attempt: linear regression and decoder. Image by creator

One approach is so as to add a dense layer on top of the rating, with a single input and as many outputs because the categories. This may learn the thresholds, and predict the chances of every category via softmax. Training then can occur as usual using a categorical cross-entropy loss.

Clearly, the dense layer won’t learn the thresholds directly; as an alternative, it can learn N weights and N biases given N output categories. So let’s work out the right way to get the thresholds from these.

Step 3: Extracting the thresholds

Notice that the output of the softmax layer is the vector of probabilities for every category; the anticipated category is the one with the very best probability. Moreover, softmax works in a way that it all the time maps the most important input value to the most important probability. Due to this fact, the most important output of the dense layer corresponds to the category that it predicts based on the incoming rating.

If the dense layer has learnt the weights [w1, w2, w3] and the biases [b1, b2, b3], then its outputs are

o1 = w1*rating + b1
o2 = w2*rating + b2
o3 = w3*rating + b3

These are all just straight lines as a function of the incoming rating (e.g. y = w1*x + b1), and whichever is at the highest at a given rating is the winning category. Here’s a quick illustration:

2D chart showing three lines coloured according to which is the largest at a given x
Three linear functions mapping the one rating to the raw likelihood of every category. Image by creator

The thresholds are then the intersection points between the neighboring lines. Assuming the order of categories to be o1 (home) → o2 (hospital) → o3 (death), we want to unravel the o1 = o2 and o2 = o3 equations, yielding

t0 = (b2 – b1) / (w1 – w2)
t1 = (b3 – b2) / (w2 – w3)

That is implemented in ScoringModel.extract_thresholds() (though there may be some additional logic there explained below).

Step 4: Ordering the categories

But how can we know what’s the fitting order of the categories? Clearly now we have a preferred order (home → hospital → death), but what is going to the model say?

It’s value noting a few things in regards to the lines that represent which category wins at each rating. As we’re curious about whichever line is the very best, we’re talking in regards to the boundary of the region that’s above all lines:

2D chart showing three lines coloured according to which is the largest at a given x
The winning (largest) line segments are the boundaries of the highlighted convex region. Image by creator

Since this area is the intersection of all half-planes which are above each line, it’s necessarily convex. (Note that no line could be vertical.) Because of this each category wins over exactly one range of scores; it cannot get back to the highest again later.

It also signifies that these ranges are necessarily within the order of the slopes of the lines, that are the weights. The biases influence the values of the thresholds, but not the order. We first have negative slopes, followed by small after which big positive slopes.

It’s because given any two lines, towards negative infinity the one with the smaller slope (weight) will win, and towards positive infinity, the opposite. Algebraically speaking, given two lines

f1(x) = w1*x + b1 and f2(x) = w2*x + b2 where w2 > w1,

we already know they intersect at (b2 – b1) / (w1 – w2), and below this, if x < (b2 – b1) / (w1 – w2), then
(w1 – w2)x > b2 – b1   (w1 – w2 is negative!)
w1*x + b1 > w2*x – b2
f1(x) > f2(x),
and so f1 wins. The identical argument holds in the opposite direction.

Step 4.5: We tousled (propagate-sum)

And here lies an issue: the scoring model is kind of free to choose what order to place the categories in. That’s not good: a rating that predicts death at 0, home treatment at 10, and hospitalization at 20 is clearly nonsensical. Nevertheless, with certain inputs (especially if one feature dominates a category) this could occur even with very simple scoring models like a linear regression.

There may be a method to protect against this though. Keras allows adding a kernel constraint to a dense layer to force all weights to be non-negative. We could take this code and implement a kernel constraint that forces the weights to be in increasing order (w1 ≤ w2 ≤ w3), however it is easier if we stick with the available tools. Fortunately, Keras tensors support slicing and concatenation, so we are able to split the outputs of the dense layer into components (say, d1, d2, d3) and use the next because the input into the softmax:

  • o1 = d1
  • o2 = d1 + d2
  • o3 = d1 + d2 + d3

Within the code, that is known as “propagate sum.”

Neural network diagram showing two dense layers in an encoder-decoder relationship followed by porpagate-sum and softmax operations
Final model: linear regression and a category approximator head enforcing increasing order of weights (see build_linear_bottleneck_model). Image by creator

Substituting the weights and biases into the above we get

  • o1 = w1*rating + b1
  • o2 = (w1+w2)*rating + b1+b2
  • o3 = (w1+w2+w3)*rating + b1+b2+b3

Since w1, w2, w3 are all non-negative, now we have now ensured that the effective weights used to choose the winning category are in increasing order.

Step 5: Training and evaluating

All of the components at the moment are together to coach the linear regression. The model is implemented in ScoringModel.build_linear_bottleneck_model() and could be trained by running train.py --linear-bottleneck. The code also routinely extracts the thresholds and the weights of the linear combination after each epoch. Note that as a final calculation, we want to shift each threshold by the bias within the encoder layer.

Epoch #4 finished. Logs: {'accuracy': 0.7988250255584717, 'loss': 0.4569114148616791, 'val_accuracy': 0.7993124723434448, 'val_loss': 0.4509878158569336}
----- Evaluating the bottleneck model -----
Prev infection A   weight: -0.22322197258472443
Prev infection B   weight: -0.1420486718416214
Acute infection B  weight: 0.43141448497772217
Cancer diagnosis   weight: 0.48094701766967773
Weight deviation   weight: 1.1893583536148071
Age                weight: 1.4411307573318481
Blood pressure dev weight: 0.8644841313362122
Smoked years       weight: 1.1094108819961548
Threshold: -1.754680637036648
Threshold: 0.2920824065597968

The linear regression can approximate the toy example with an accuracy of 80%, which is pretty good. Naturally, the utmost achievable accuracy is dependent upon whether the system to be modeled is near linear or not. If not, one can think about using a more capable network because the encoder; for instance, a number of dense layers with nonlinear activations. The network should still not have enough capability to condense the projected rating an excessive amount of.

It is usually value noting that with the linear combination, the dimensionality of the burden space the training happens in is minuscule in comparison with regular neural networks (just N where N is the variety of input features, in comparison with thousands and thousands, billions or more). There may be a regularly described intuition that on high-dimensional error surfaces, real local minima and maxima are very rare – there is sort of all the time a direction wherein training can proceed to scale back loss. That's, most areas of zero gradient are saddle points. We don't have this luxury in our 8-dimensional weight space, and indeed, training can get stuck in local extrema even with optimizers like Adam. Training is amazingly fast though, and running multiple training sessions can solve this problem.

As an instance how the learnt linear model functions, ScoringModel.try_linear_model() tries it on a set of random inputs. Within the output, the goal and predicted outcomes are noted by their index number (0: treatment at home, 1: hospitalized, 2: death):

Sample #0: goal=1 rating=-1.18 predicted=1 okay
Sample #1: goal=2 rating=+4.57 predicted=2 okay
Sample #2: goal=0 rating=-1.47 predicted=1 x
Sample #3: goal=2 rating=+0.89 predicted=2 okay
Sample #4: goal=0 rating=-5.68 predicted=0 okay
Sample #5: goal=2 rating=+4.01 predicted=2 okay
Sample #6: goal=2 rating=+1.65 predicted=2 okay
Sample #7: goal=2 rating=+4.63 predicted=2 okay
Sample #8: goal=2 rating=+7.33 predicted=2 okay
Sample #9: goal=2 rating=+0.57 predicted=2 okay

And ScoringModel.visualize_linear_model() generates a histogram of the rating from a batch of random inputs. As above, “.” notes home treatment, “o” stands for hospitalization, and “+” death. For instance:

                                     +                       
                                     +                       
                                     +                       
                                     +  +                    
                                     +  +                    
                 .    o              +  +      +    +        
..          ..   . o oo ooo  o+ +  + ++ +      + +  +        
..          ..   . o oo ooo  o+ +  + ++ +      + +  +        
.. .. .   . .... . o oo oooooo+ ++ + ++ + +    + +  +    +  +
.. .. .   . .... . o oo oooooo+ ++ + ++ + +    + +  +    +  +

The histogram is spiky attributable to the boolean inputs, which (before normalization) are either 0 or 1 within the linear combination, but the general histogram continues to be much smoother than the outcomes we got with the 2-layer neural network above. Many input vectors are mapped to scores which are on the thresholds between the outcomes, allowing us to predict if a patient is dangerously near getting hospitalized, or needs to be admitted to intensive care as a precaution.

Conclusion

Easy models like linear regressions and other low-capacity networks have desirable properties in quite a few applications. They're highly interpretable and verifiable by humans – for instance, from the outcomes of the toy example above we are able to clearly see that previous infections protect patients from worse outcomes, and that age is crucial consider determining the severity of an ongoing infection.

One other property of linear regressions is that their output moves roughly in keeping with their inputs. It is that this feature that we used to amass a comparatively smooth, continuous rating from just a number of anchor points offered by the limited information available within the training data. Furthermore, we did so based on well-known network components available in major frameworks including Keras. Finally, we used a little bit of math to extract the data we want from the trainable parameters within the model, and to be sure that the rating learnt is meaningful, that's, that it covers the outcomes (categories) in the specified order.

Small, low-capacity models are still powerful tools to unravel the fitting problems. With quick and low cost training, they will also be implemented, tested and iterated over extremely quickly, fitting nicely into agile approaches to development and engineering.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x