Exploring multi-quantile regression with Catboost
How confident can we be in a machine learning model’s prediction? This query has been a outstanding area of research during the last decade, and it has major implications in high-stakes machine learning applications comparable to finance and healthcare. While many classification models, particularly calibrated models, include uncertainty quantification by predicting a probability distribution over goal classes, quantifying uncertainty in regression tasks is rather more nuanced.
Amongst many proposed methods, quantile regression is one of the vital popular because no assumptions are made concerning the goal distribution. Until recently, the foremost drawback of quantile regression was that one model needed to be trained per predicted quantile. As an example, as a way to predict the tenth, fiftieth, and ninetieth quantiles of a goal distribution, three independent models would should be trained. Catboost has since addressed this issue with the multi-quantile loss function — a loss function that allows a single model to predict an arbitrary variety of quantiles.
This text will explore two examples using the multi-quantile loss function on synthetic data. While these examples aren’t necessarily reflective of real-world datasets, they are going to help us understand how well this loss function quantifies uncertainty by predicting the quantiles of a goal distribution. For a fast refresher on noise and uncertainty in machine learning, see this text:
In supervised machine learning, the standard task is to coach a model that predicts the expected value of a goal given a set of input features:
Notably, training a model this fashion produces a single prediction indicating what the model believes is probably the most likely value of the goal given the features. In regression tasks, this will likely be the mean of the goal distribution conditioned on the features.
Nonetheless, as illustrated in a previous article, most machine learning models are trained on noisy datasets, and easily predicting the conditional expected value of the goal doesn’t sufficiently characterize the goal distribution. To see this, consider the next noisy regression dataset:
Despite the fact that the model prediction suits the information optimally, it doesn’t quantify uncertainty. As an example, when x is around 0.4, the road of best fit predicts that y is 3.8. While it’s true that 3.8 is the almost definitely value of y when x is near 0.4, there are many examples in the information where y is far higher or lower than 3.8. Said in a different way, the conditional distribution of the goal has high variability beyond its exception, and quantile regression helps us describe this.
In quantile regression, the loss function is modified to encourage a model to learn the specified quantile of the conditional goal distribution.
To realize a greater intuition about this loss function, suppose a model is being trained to learn the ninety fifth quantile of a goal distribution. In the primary case, consider a single training example where the goal was 45 and the model predicted 60 (i.e. the model overestimated the goal by 15). Assume also that every training example has a weight of 1. The loss function evaluated at these values looks like this:
Now, with the identical goal of 45, assume that the model predicted 30 (i.e. the model underestimated the goal by 15). The worth of the loss function looks much different:
Despite the model prediction being off by 15 in each examples, the loss function penalized the underestimation much higher than the overestimation. Since the ninety fifth quantile is being learned, the loss function penalizes any prediction below the goal by an element of 0.95, and any prediction above the goal by an element of 0.05. Hence, the model is “forced” to favor overprediction above underprediction when learning the ninety fifth quantile. That is the case when learning any quantiles above the median — the alternative is true when learning quantiles below the median. To see this higher, we will visualize the loss function for every predicted quantile:
Catboost now extends this concept by allowing the bottom decision trees to output multiple quantiles per node. This permits a single model to predict multiple quantiles by minimizing a recent loss function:
To grasp how the multi-quantile loss function works, let’s start with an easy dataset. The next python code generates an artificial linear dataset with gaussian additive noise:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostRegressor
sns.set()# Number of coaching and testing examples
n = 1000
# Generate random x values between 0 and 1
x_train = np.random.rand(n)
x_test = np.random.rand(n)
# Generate random noise for the goal
noise_train = np.random.normal(0, 0.3, n)
noise_test = np.random.normal(0, 0.3, n)
# Set the slope and y-intercept of the road
a, b = 2, 3
# Generate y values in keeping with the equation y = ax + b + noise
y_train = a * x_train + b + noise_train
y_test = a * x_test + b + noise_test
The resulting training data should look much like this:
Next, the quantiles of the goal distribution should be specified for prediction. For example the facility of the multi-quantile loss, this model will seek to predict 99 different quantile values for every statement. This may almost be regarded as taking samples of size 99 from each predicted distribution.
# Store quantiles 0.01 through 0.99 in an inventory
quantiles = [q/100 for q in range(1, 100)]# Format the quantiles as a string for Catboost
quantile_str = str(quantiles).replace('[','').replace(']','')
# Fit the multi quantile model
model = CatBoostRegressor(iterations=100,
loss_function=f'MultiQuantile:alpha={quantile_str}')
model.fit(x_train.reshape(-1,1), y_train)
# Make predictions on the test set
preds = model.predict(x_test.reshape(-1, 1))
preds = pd.DataFrame(preds, columns=[f'pred_{q}' for q in quantiles])
The resulting predictions DataFrame looks something like this:
Each row corresponds to a testing example and every column gives a predicted quantile. As an example, the tenth quantile prediction for the primary test example was 3.318624. Since there is simply one input feature, we will visualize a number of predicted quantiles overlayed on the testing data:
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x_test, y_test)for col in ['pred_0.05', 'pred_0.5', 'pred_0.95']:
ax.scatter(x_test.reshape(-1,1), preds[col], alpha=0.50, label=col)
ax.legend()
Visually, the ninety fifth and fifth quantiles appear to do an excellent job accounting for uncertainty in the information. Furthermore, the fiftieth quantile (i.e the median) roughly approximates the road of best fit. When working with predicted quantiles, one metric we’re often excited by analyzing is coverage. Coverage is the share of targets that fall between two desired quantiles. For example, coverage will be computed using the fifth and ninety fifth quantiles as follows:
coverage_90 = np.mean((y_test <= preds['pred_0.95']) & (y_test >= preds['pred_0.05']))*100
print(coverage_90)
# Output: 91.4
Using the fifth and ninety fifth quantiles, assuming perfect calibration, we might expect to have a coverage of 95–5 = 90%. In this instance, the anticipated quantiles were barely off but still close, giving a coverage value of 91.4%.
Let’s now analyze the whole predicted distribution that the model outputs. Recall, the road of best fit is y = 2x + 3. Subsequently, for any input x, the mean of the anticipated distribution must be around 2x + 3. Likewise, because random gaussian noise was added to the information with an ordinary deviation of 0.3, the usual deviation of the anticipated distribution must be around 0.3. Let’s test this:
# Give the model a recent input of x = 0.4
x = np.array([0.4])# We expect the mean of this array to be about 2*0.4 + 3 = 3.8
# We expect the usual deviation of this array to be about 0.30
y_pred = model.predict(x.reshape(-1, 1))
mu = np.mean(y_pred)
sigma = np.std(y_pred)
print(mu) # Output: 3.836147287742427
print(sigma) # Output: 0.3283984093786787
# Plot the anticipated distribution
fig, ax = plt.subplots(figsize=(10, 6))
_ = ax.hist(y_pred.reshape(-1,1), density=True)
ax.set_xlabel('$y$')
ax.set_title(f'Predicted Distribution $P(y|x=4)$, $mu$ = {round(mu, 3)}, $sigma$ = {round(sigma, 3)}')
Amazingly, the anticipated distribution seems to align with our expectations. Next, let’s try a harder example.
To see the true power of multi-quantile regression, let’s make the educational task harder. The next code creates a non-linear dataset with heterogeneous noise that will depend on specific regions of the domain:
# Create regions of the domain which have variable noise
bounds = [(-10, -8), (-5, -4), (-4, -3), (-3, -1), (-1, 1), (1, 3), (3, 4), (4, 5), (8, 10)]
scales = [18, 15, 8, 11, 1, 2, 9, 16, 19]x_train = np.array([])
x_test = np.array([])
y_train = np.array([])
y_test = np.array([])
for b, scale in zip(bounds, scales):
# Randomly select the variety of samples in each region
n = np.random.randint(low=100, high = 200)
# Generate values of the domain between b[0] and b[1]
x_curr = np.linspace(b[0], b[1], n)
# For even scales, noise comes from an exponential distribution
if scale % 2 == 0:
noise_train = np.random.exponential(scale=scale, size=n)
noise_test = np.random.exponential(scale=scale, size=n)
# For odd scales, noise comes from a traditional distribution
else:
noise_train = np.random.normal(scale=scale, size=n)
noise_test = np.random.normal(scale=scale, size=n)
# Create training and testing sets
y_curr_train = x_curr**2 + noise_train
y_curr_test = x_curr**2 + noise_test
x_train = np.concatenate([x_train, x_curr])
x_test = np.concatenate([x_test, x_curr])
y_train = np.concatenate([y_train, y_curr_train])
y_test = np.concatenate([y_test, y_curr_test])
The resulting training data looks like this:
We’ll fit the Catboost regressor as the identical way as example 1 and visualize the predictions on a test set:
model = CatBoostRegressor(iterations=300,
loss_function=f'MultiQuantile:alpha={quantile_str}')model.fit(x_train.reshape(-1,1), y_train)
preds = model.predict(x_test.reshape(-1, 1))
preds = pd.DataFrame(preds, columns=[f'pred_{q}' for q in quantiles])
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x_test, y_test)
for col in ['pred_0.05', 'pred_0.5', 'pred_0.95']:
quantile = int(float(col.split('_')[-1])*100)
label_name = f'Predicted Quantile {quantile}'
ax.scatter(x_test.reshape(-1,1), preds[col], alpha=0.50, label=label_name)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Testing Data for Example 2 with Predicted Quantiles')
ax.legend()
Upon visual inspection, the model has characterised this non-linear, heteroscedastic relationship well. Notice how, near x = 0, the three predicted quantiles converge towards a single value. It is because the region near x = 0 has almost no noise — any model that accurately predicts the conditional probability distribution on this region should predict a small variance. Conversely, when x is between 7.5 and 10.0, the anticipated quantiles are much further apart due to the inherent noise within the region. 90% coverage will be computed as before:
coverage_90 = np.mean((y_test <= preds['pred_0.95']) & (y_test >= preds['pred_0.05']))*100
print(coverage_90)
# Output: 90.506
Using the fifth and ninety fifth quantiles, assuming perfect calibration, the expected coverage is 95–5 = 90%. In this instance, the anticipated quantiles were even higher than example 1, giving a coverage of 90.506%.
Finally, let’s take a look at a number of inputs and their corresponding predicted distribution.
The distribution above does a pleasant job of capturing the goal value with a comparatively small variance. That is to be expected because the goal values within the training data when x = 0 have little noise. Contrast this with the anticipated goal distribution when x = -8.56:
This distribution is correct skewed and has a much higher variance. This is anticipated for this region of information since the noise was sampled from an exponential distribution with high variance.
This text demonstrated the facility of multi-quantile regression for learning arbitrary conditional goal distributions. We only explored two one-dimensional examples to visually inspect model performance, but I might encourage anyone interested to try the multi-quantiles loss function on higher dimensional data. It’s vital to notice that quantile regression makes no statistical or algorithmic guarantees of convergence, and the performance of those models will vary depending on the character of the educational problem. Thanks for reading!
Turn into a Member: https://harrisonfhoffman.medium.com/membership
Buy me a coffee: https://www.buymeacoffee.com/HarrisonfhU
- Catboost Loss Functions — https://catboost.ai/en/docs/concepts/loss-functions-regression#MultiQuantile
jazz instrumental music