I’ve science consultant for the past three years, and I’ve had the chance to work on multiple projects across various industries. Yet, I noticed one common denominator amongst a lot of the clients I worked with:
They rarely have a transparent idea of the project objective.
That is certainly one of the predominant obstacles data scientists face, especially now that Gen AI is taking up every domain.
But let’s suppose that after some forwards and backwards, the target becomes clear. We managed to pin down a particular query to reply. For instance:
Well, now what? Easy, let’s start constructing some models!
Unsuitable!
If having a transparent objective is rare, having a reliable benchmark is even rarer.
For my part, one of the essential steps in delivering a knowledge science project is defining and agreeing on a set of benchmarks with the client.
On this blog post, I’ll explain:
- What a benchmark is,
- Why it’s important to have a benchmark,
- How I might construct one using an example scenario and
- Some potential drawbacks to be mindful
What’s a benchmark?
A benchmark is a standardized solution to evaluate the performance of a model. It provides a reference point against which recent models might be compared.
A benchmark needs two key components to be considered complete:
- A set of metrics to guage the performance
- A set of straightforward models to make use of as baselines
The concept at its core is easy: each time I develop a brand new model I compare it against each previous versions and the baseline models. This ensures improvements are real and tracked.
It is important to grasp that this baseline shouldn’t be model or dataset-specific, but slightly business-case-specific. It ought to be a general benchmark for a given business case.
If I encounter a brand new dataset, with the identical business objective, this benchmark ought to be a reliable reference point.
Why constructing a benchmark is vital
Now that we’ve defined what a benchmark is, let’s dive into why I imagine it’s price spending an additional project week on the event of a robust benchmark.
- With no Benchmark you’re aiming for perfection — Should you are working and not using a clear reference point any result will lose meaning. Is that good? IDK! Perhaps with a straightforward mean you’ll get a MAE of 25.000. By comparing your model to a baseline, you may measure each performance and improvement.
- Improves Communicating with Clients — Clients and business teams may not immediately understand the usual output of a model. Nevertheless, by engaging them with easy baselines from the beginning, it becomes easier to reveal improvements later. In lots of cases benchmarks could come directly from the business in numerous shapes or forms.
- Helps in Model Selection — A benchmark gives a start line to check multiple models fairly. Without it, you may waste time testing models that aren’t price considering.
- Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you may have the ability to intercept drifts early by comparing recent model outputs against past benchmarks and baselines.
- Consistency Between Different Datasets — Datasets evolve. By having a hard and fast set of metrics and models you make sure that performance comparisons remain valid over time.
With a transparent benchmark, every step within the model development will provide immediate feedback, making the entire process more intentional and data-driven.
How I might construct a benchmark
I hope I’ve convinced you of the importance of getting a benchmark. Now, let’s actually construct one.
Let’s start from the business query we presented on the very starting of this blog post:
For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist.
For this instance, I’m using (CC0: Public Domain). The info accommodates some attributes from an organization’s customer base (e.g., age, sex, variety of products, …) together with their churn status.
Now that we have now something to work on let’s construct the benchmark:
1. Defining the metrics
We’re coping with a churn use case, particularly, it is a binary classification problem. Thus the predominant metrics that we could use are:
- Precision — Percentage of appropriately predicted churners amongst all predicted churners
- Recall — Percentage of actual churners appropriately identified
- F1 rating — Balances precision and recall
- True Positives, False Positives, True Negative and False Negatives
These are a number of the “easy” metrics that may very well be used to guage the output of a model.
Nevertheless, it will not be an exhaustive list, standard metrics aren’t at all times enough. In lots of use cases, it may be useful to construct custom metrics.
Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a reduction. This creates:
- A cost ($250) when offering the discount to a non-churning customer
- A profit ($1000) when retaining a churning customer
Following on this definition we are able to construct a custom metric that will probably be crucial in our scenario:
# Defining the business case-specific reference metric
def financial_gain(y_true, y_pred):
loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250
gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000
return gain_from_tp - loss_from_fp
When you’re constructing business-driven metrics these are frequently essentially the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more.
2. Defining the benchmarks
Now that we’ve defined our metrics, we are able to define a set of baseline models for use as a reference.
On this phase, it is best to define a listing of simple-to-implement model of their simplest possible setup. There isn’t any reason at this state to spend time and resources on the optimization of those models, my mindset is:
If I had quarter-hour, how would I implement this model?
In later phases of the model, you may add mode baseline models because the project proceeds.
On this case, I’ll use the next models:
- Random Model — Assigns labels randomly
- Majority Model — At all times predicts essentially the most frequent class
- Easy XGB
- Easy KNN
import numpy as np
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
class BinaryMean():
@staticmethod
def run_benchmark(df_train, df_test):
np.random.seed(21)
return np.random.alternative(a=[1, 0], size=len(df_test), p=[df_train['y'].mean(), 1 - df_train['y'].mean()])
class SimpleXbg():
@staticmethod
def run_benchmark(df_train, df_test):
model = xgb.XGBClassifier()
model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])
return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))
class MajorityClass():
@staticmethod
def run_benchmark(df_train, df_test):
majority_class = df_train['y'].mode()[0]
return np.full(len(df_test), majority_class)
class SimpleKNN():
@staticmethod
def run_benchmark(df_train, df_test):
model = KNeighborsClassifier()
model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])
return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))
Again, as within the case of the metrics, we are able to construct custom benchmarks.
Let’s assume that in our business case the the marketing team contacts every client who’s:
- Over 50 y/o and
- That’s not lively anymore
Following this rule we are able to construct this model:
# Defining the business case-specific benchmark
class BusinessBenchmark():
@staticmethod
def run_benchmark(df_train, df_test):
df = df_test.copy()
df.loc[:,'y_hat'] = 0
df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1
return df['y_hat']
Running the benchmark
To run the benchmark I’ll use the next class. The entry point is the strategy compare_with_benchmark()
that, given a prediction, runs all of the models and calculates all of the metrics.
import numpy as np
class ChurnBinaryBenchmark():
def __init__(
self,
metrics = [],
benchmark_models = [],
):
self.metrics = metrics
self.benchmark_models = benchmark_models
def compare_pred_with_benchmark(
self,
df_train,
df_test,
my_predictions,
):
output_metrics = {
'Prediction': self._calculate_metrics(df_test['y'], my_predictions)
}
dct_benchmarks = {}
for model in self.benchmark_models:
dct_benchmarks[model.__name__] = model.run_benchmark(df_train = df_train, df_test = df_test)
output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])
return output_metrics
def _calculate_metrics(self, y_true, y_pred):
return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}
Now all we want is a prediction. For this instance, I made a fast feature engineering and a few hyperparameter tuning.
The last step is simply to run the benchmark:
binary_benchmark = ChurnBinaryBenchmark(
metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],
benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]
)
res = binary_benchmark.compare_pred_with_benchmark(
df_train=df_train,
df_test=df_test,
my_predictions=preds,
)
pd.DataFrame(res)
This generates a comparison table of all models across all metrics. Using this table, it is feasible to attract concrete conclusions on the model’s predictions and make informed decisions on the next steps of the method.
Some drawbacks
As we’ve seen there are many explanation why it is helpful to have a benchmark. Nevertheless, although benchmarks are incredibly useful, there are some pitfalls to observe out for:
- Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of getting a benchmark decreases.
- Misinterpretation by Stakeholders — Communication with the client is important, it’s important to state clearly what the metrics are measuring.
- Overfitting to the Benchmark — You would possibly find yourself attempting to create features which might be too specific, which may beat the benchmark, but don’t generalize well in prediction.
- Change of Objective — Objectives defined might change, as a consequence of miscommunication or changes in plans.
Final thoughts
Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They assist avoid the trap of assuming a model is performing well without proof and make sure that every iteration brings real value.
In addition they act as a communication tool, making it easier to elucidate progress to clients. As a substitute of just presenting numbers, you may show clear comparisons that highlight improvements.