MODEL EVALUATION & OPTIMIZATION
You’ve trained several classification models, and so they all appear to be performing well with high accuracy scores. Congratulations!
But hold on — is one model truly higher than the others? Accuracy alone doesn’t tell the entire story. What if one model consistently overestimates its confidence, while one other underestimates it? That is where model calibration is available in.
Here, we’ll see what model calibration is and explore the right way to assess the reliability of your models’ predictions — using visuals and practical code examples to point out you the right way to discover calibration issues. Get able to transcend accuracy and light-weight up the true potential of your machine learning models!
Model calibration measures how well a model’s prediction probabilities match its actual performance. A model that provides a 70% probability rating must be correct 70% of the time for similar predictions. This implies its probability scores should reflect the true likelihood of its predictions being correct.
Why Calibration Matters
While accuracy tells us how often a model is correct overall, calibration tells us whether we are able to trust its probability scores. Two models might each have 90% accuracy, but one might give realistic probability scores while the opposite gives overly confident predictions. In lots of real applications, having reliable probability scores is just as essential as having correct predictions.
Perfect Calibration vs. Reality
A wonderfully calibrated model would show a direct match between its prediction probabilities and actual success rates: When it predicts with 90% probability, it must be correct 90% of the time. The identical applies to all probability levels.
Nevertheless, most models aren’t perfectly calibrated. They may be:
- Overconfident: giving probability scores which might be too high for his or her actual performance
- Underconfident: giving probability scores which might be too low for his or her actual performance
- Each: overconfident in some ranges and underconfident in others
This mismatch between predicted probabilities and actual correctness can result in poor decision-making when using these models in real applications. This is the reason understanding and improving model calibration is needed for constructing reliable machine learning systems.
To explore model calibration, we’ll proceed with the identical dataset utilized in my previous articles on Classification Algorithms: predicting whether someone will play golf or not based on weather conditions.
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split# Create and prepare dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare data
df = pd.DataFrame(dataset_dict)
Before training our models, we normalized numerical weather measurements through standard scaling and transformed categorical features with one-hot encoding. These preprocessing steps ensure all models can effectively use the information while maintaining fair comparisons between them.
from sklearn.preprocessing import StandardScaler
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]
# Prepare features and goal
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical features
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])
Models and Training
For this exploration, we trained 4 classification models to similar accuracy scores:
- K-Nearest Neighbors (kNN)
- Bernoulli Naive Bayes
- Logistic Regression
- Multi-Layer Perceptron (MLP)
For many who are curious with how those algorithms make prediction and their probability, you may seek advice from this text:
While these models achieved the identical accuracy in this easy problem, they calculate their prediction probabilities in a different way.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import BernoulliNB# Initialize the models with the found parameters
knn = KNeighborsClassifier(n_neighbors=4, weights='distance')
bnb = BernoulliNB()
lr = LogisticRegression(C=1, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(4, 2),random_state=42, max_iter=2000)
# Train all models
models = {
'KNN': knn,
'BNB': bnb,
'LR': lr,
'MLP': mlp
}
for name, model in models.items():
model.fit(X_train, y_train)
# Create predictions and probabilities for every model
results_dict = {
'True Labels': y_test
}
for name, model in models.items():
# results_dict[f'{name} Pred'] = model.predict(X_test)
results_dict[f'{name} Prob'] = model.predict_proba(X_test)[:, 1]
# Create results dataframe
results_df = pd.DataFrame(results_dict)
# Print predictions and probabilities
print("nPredictions and Probabilities:")
print(results_df)
# Print accuracies
print("nAccuracies:")
for name, model in models.items():
accuracy = accuracy_score(y_test, model.predict(X_test))
print(f"{name}: {accuracy:.3f}")
Through these differences, we’ll explore why we want to look beyond accuracy.
To evaluate how well a model’s prediction probabilities match its actual performance, we use several methods and metrics. These measurements help us understand whether our model’s confidence levels are reliable.
Brier Rating
The Brier Rating measures the mean squared difference between predicted probabilities and actual outcomes. It ranges from 0 to 1, where lower scores indicate higher calibration. This rating is especially useful since it considers each calibration and accuracy together.
Log Loss
Log Loss calculates the negative log probability of correct predictions. This metric is very sensitive to confident but fallacious predictions — when a model says it’s 90% sure but is fallacious, it receives a much larger penalty than when it’s 60% sure and fallacious. Lower values indicate higher calibration.
Expected Calibration Error (ECE)
ECE measures the common difference between predicted and actual probabilities (taken as average of the label), weighted by what number of predictions fall into each probability group. This metric helps us understand if our model has systematic biases in its probability estimates.
Reliability Diagrams
Just like ECE, a reliability diagram (or calibration curve) visualizes model calibration by binning predictions and comparing them to actual outcomes. While ECE gives us a single number measuring calibration error, the reliability diagram shows us the identical information graphically. We use the identical binning approach and calculate the actual frequency of positive outcomes in each bin. When plotted, these points show us exactly where our model’s predictions deviate from perfect calibration, which would seem as a diagonal line.
Comparing Calibration Metrics
Each of those metrics shows different features of calibration problems:
- A high Brier Rating suggests overall poor probability estimates.
- High Log Loss points to overconfident fallacious predictions.
- A high ECE indicates systematic bias in probability estimates.
Together, these metrics give us a whole picture of how well our model’s probability scores reflect its true performance.
Our Models
For our models, let’s calculate the calibration metrics and draw their calibration curves:
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# Initialize models
models = {
'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),
'Bernoulli Naive Bayes': BernoulliNB(),
'Logistic Regression': LogisticRegression(C=1.5, random_state=42),
'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)
}
# Get predictions and calculate metrics
metrics_dict = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
metrics_dict[name] = {
'Brier Rating': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob),
'Probabilities': y_prob
}
# Plot calibration curves
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)
colours = ['orangered', 'slategrey', 'gold', 'mediumorchid']
for idx, (name, metrics) in enumerate(metrics_dict.items()):
ax = axes.ravel()[idx]
prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'],
n_bins=5, strategy='uniform')
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, color=colours[idx], marker='o',
label='Calibration curve', linewidth=2, markersize=8)
title = f'{name}nBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
ax.set_title(title, fontsize=11, pad=10)
ax.grid(True, alpha=0.7)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.legend(fontsize=10, loc='upper left')
plt.tight_layout()
plt.show()
Now, let’s analyze the calibration performance of every model based on those metrics:
The k-Nearest Neighbors (KNN) model performs well at estimating how certain it must be about its predictions. Its graph line stays near the dotted line, which shows good performance. It has solid scores — a Brier rating of 0.148 and the perfect ECE rating of 0.090. While it sometimes shows an excessive amount of confidence in the center range, it generally makes reliable estimates about its certainty.
The Bernoulli Naive Bayes model shows an unusual stair-step pattern in its line. This implies it jumps between different levels of certainty as a substitute of fixing easily. While it has the identical Brier rating as KNN (0.148), its higher ECE of 0.150 shows it’s less accurate at estimating its certainty. The model switches between being too confident and never confident enough.
The Logistic Regression model shows clear issues with its predictions. Its line moves distant from the dotted line, meaning it often misjudges how certain it must be. It has the worst ECE rating (0.181) and a poor Brier rating (0.164). The model consistently shows an excessive amount of confidence in its predictions, making it unreliable.
The Multilayer Perceptron shows a definite problem. Despite having the perfect Brier rating (0.129), its line reveals that it mostly makes extreme predictions — either very certain or very uncertain, with little in between. Its high ECE (0.167) and flat line in the center ranges show it struggles to make balanced certainty estimates.
After examining all 4 models, the k-Nearest Neighbors clearly performs best at estimating its prediction certainty. It maintains consistent performance across different levels of certainty and shows essentially the most reliable pattern in its predictions. While other models might rating well in certain measures (just like the Multilayer Perceptron’s Brier rating), their graphs reveal they aren’t as reliable when we want to trust their certainty estimates.
When selecting between different models, we want to contemplate each their accuracy and calibration quality. A model with barely lower accuracy but higher calibration may be more worthwhile than a highly accurate model with poor probability estimates.
By understanding calibration and its importance, we are able to construct more reliable machine learning systems that users can trust not only for his or her predictions, but additionally for his or her confidence in those predictions.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# Define ECE
def calculate_ece(y_true, y_prob, n_bins=5):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):
mask = (y_prob >= bin_lower) & (y_prob < bin_upper)
if np.sum(mask) > 0:
bin_conf = np.mean(y_prob[mask])
bin_acc = np.mean(y_true[mask])
ece += np.abs(bin_conf - bin_acc) * np.sum(mask)
return ece / len(y_true)
# Create dataset and prepare data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare and encode data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]
# Split and scale data
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])
# Train model and get predictions
model = BernoulliNB()
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
# Calculate metrics
metrics = {
'Brier Rating': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob)
}
# Plot calibration curve
plt.figure(figsize=(6, 6), dpi=300)
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=5, strategy='uniform')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.plot(prob_pred, prob_true, color='slategrey', marker='o',
label='Calibration curve', linewidth=2, markersize=8)
title = f'Bernoulli Naive BayesnBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
plt.title(title, fontsize=11, pad=10)
plt.grid(True, alpha=0.7)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.gca().spines[['top', 'right', 'left', 'bottom']].set_visible(False)
plt.legend(fontsize=10, loc='lower right')
plt.tight_layout()
plt.show()
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# Define ECE
def calculate_ece(y_true, y_prob, n_bins=5):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):
mask = (y_prob >= bin_lower) & (y_prob < bin_upper)
if np.sum(mask) > 0:
bin_conf = np.mean(y_prob[mask])
bin_acc = np.mean(y_true[mask])
ece += np.abs(bin_conf - bin_acc) * np.sum(mask)
return ece / len(y_true)
# Create dataset and prepare data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare and encode data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]
# Split and scale data
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])
# Initialize models
models = {
'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),
'Bernoulli Naive Bayes': BernoulliNB(),
'Logistic Regression': LogisticRegression(C=1.5, random_state=42),
'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)
}
# Get predictions and calculate metrics
metrics_dict = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
metrics_dict[name] = {
'Brier Rating': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob),
'Probabilities': y_prob
}
# Plot calibration curves
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)
colours = ['orangered', 'slategrey', 'gold', 'mediumorchid']
for idx, (name, metrics) in enumerate(metrics_dict.items()):
ax = axes.ravel()[idx]
prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'],
n_bins=5, strategy='uniform')
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, color=colours[idx], marker='o',
label='Calibration curve', linewidth=2, markersize=8)
title = f'{name}nBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
ax.set_title(title, fontsize=11, pad=10)
ax.grid(True, alpha=0.7)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.legend(fontsize=10, loc='upper left')
plt.tight_layout()
plt.show()
Technical Environment
This text uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary barely with different versions.
In regards to the Illustrations
Unless otherwise noted, all images are created by the writer, incorporating licensed design elements from Canva Pro.
𝙎𝙚𝙚 𝙢𝙤𝙧𝙚 𝙈𝙤𝙙𝙚𝙡 𝙀𝙫𝙖𝙡𝙪𝙖𝙩𝙞𝙤𝙣 & 𝙊𝙥𝙩𝙞𝙢𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙢𝙚𝙩𝙝𝙤𝙙𝙨 𝙝𝙚𝙧𝙚:
Model Evaluation & Optimization
𝙔𝙤𝙪 𝙢𝙞𝙜𝙝𝙩 𝙖𝙡𝙨𝙤 𝙡𝙞𝙠𝙚: