Confusion Matrix Made Easy: Accuracy, Precision, Recall & F1-Rating

we cope with classification algorithms in machine learning like Logistic Regression, K-Nearest Neighbors, Support Vector Classifiers, etc., we don’t use evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).

As a substitute, we generate a confusion matrix, and based on the confusion matrix, a classification report.

On this blog, we aim to grasp what a confusion matrix is, how you can calculate Accuracy, Precision, Recall and F1-Rating using it, and how you can select the relevant metric based on the characteristics of the info.

To know the confusion matrix and classification metrics, let’s use the Breast Cancer Wisconsin Dataset.

This dataset consists of 569 rows, and every row provides information on various features of a tumor together with its diagnosis, whether it’s malignant (cancerous) or benign (non-cancerous).

Now let’s construct a classification model for this data to categorise the tumors based on their features.

We now apply Logistic Regression to coach a model on this dataset.

Code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset 
column_names = [
    "id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean",
    "compactness_mean", "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean",
    "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se",
    "concave_points_se", "symmetry_se", "fractal_dimension_se", "radius_worst", "texture_worst",
    "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst",
    "concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
]

df = pd.read_csv("C:/wdbc.data", header=None, names=column_names)

# Drop ID column
df = df.drop(columns=["id"])

# Encode goal: M=1 (malignant), B=0 (benign)
df["diagnosis"] = df["diagnosis"].map({"M": 1, "B": 0})

# Split features and goal
X = df.drop(columns=["diagnosis"])
y = df["diagnosis"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Confusion Matrix and Classification Report
conf_matrix = confusion_matrix(y_test, y_pred, labels=[1, 0])  # 1 = Malignant, 0 = Benign

report = classification_report(y_test, y_pred, labels=[1, 0], target_names=["Malignant", "Benign"])

# Display results
print("Confusion Matrix:n", conf_matrix)
print("nClassification Report:n", report)

# Plot Confusion Matrix
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Purples", xticklabels=["Malignant", "Benign"], yticklabels=["Malignant", "Benign"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

Here, after applying logistic regression to the info, we generated a confusion matrix and a classification report to guage the model’s performance.

First let’s understand the confusion matrix

Image by Creator

From the above confusion matrix

’60’ represents the appropriately predicted Malignant Tumors, which we seek advice from as “True Positives”.

‘4’ represents the incorrectly predicted Benign Tumors which are literally Malignant Tumors, which we seek advice from as “False Negatives”.

‘1’ represents the incorrectly predicted Malignant Tumors which are literally Benign Tumors, which we seek advice from as “False Positives”.

‘106’ represents the appropriately predicted Benign Tumors, which we seek advice from as “True Negatives”.

Now let’s see what we will do with these values.

For that we consider the classification report.

From the above classification report, we will say that

For Malignant:
– Precision is 0.98, which implies when the model predicts the tumor as Malignant, it’s correct 98% of the time.
– Recall is 0.94, which implies the model appropriately identified 94% of all Malignant Tumors.
– F1-score is 0.96, which balances each the precision and recall.

For Benign:
– Precision is 0.96, which implies when the model predicts the tumor as Benign, it’s correct 96% of the time.
– Recall is 0.99, which implies the model appropriately identified 99% of all Benign Tumors.
– F1-score is 0.98.

From the report we will observe that the accuracy of the model is 97%.

We even have Macro Average and Weighted Average, let’s see how these are calculated.

Macro Average

Macro Average calculates the typical of all metrics (precision, recall and f1-score) across each classes giving equal weight to every class, no matter what number of samples each class accommodates.

We use macro average, when we wish to know the performance of model across all classes, ignoring class imbalances.

For this data:

Weighted Average

Weighted Average also calculates the typical of all metrics but gives more weight to the category with more samples.

Within the above code, we used test_size = 0.3, which implies we put aside 30% for testing which implies we’re using 171 samples from an information of 569 samples for a test set.

The confusion matrix and classification report are based on this test set.

Out of 171 samples of test set, we now have 64 Malignant tumors and 107 Benign tumors.

Now let’s see how this weighted average is calculated for all metrics.

Weighted average gives us a more realistic performance measure when we now have the category imbalanced datasets.

We now got an idea of every term within the classification report and in addition how you can calculate the macro and weighted averages.

Now let’s see what’s using confusion matrix for generating a classification report.

In classification report we now have different metrics like accuracy, precision etc. and these metrics are calculated using the values within the confusion matrix.

From the confusion matrix we now have
True Positives (TP) = 60
False Negatives (FN) = 4
False Positives (FP) = 1
True Negatives (TN) = 106

Now let’s calculate the classification metrics using these values.

That is how we calculate the classification metrics using a confusion matrix.

But why do we now have 4 different classification metrics as an alternative of 1 metric like accuracy? It’s because the several metrics show different strengths and weaknesses of the classifier based on the context of the info.

Now let’s come back to the Wisconsin Breast Cancer Dataset which we used here.

After we applied a logistic regression model to this data, we got an accuracy of 97% which is high, which can make us think that the model is efficient.

But let’s consider one other metric called ‘recall’ which is 0.94 for this model, which implies out of all of the malignant tumors we now have within the test set the model was capable of discover 94% of them appropriately.

Here the model missed 6% of malignant cases.

In real-world scenarios, mainly healthcare applications like cancer detection, if we miss a positive case, it’d delay the diagnosis and treatment.

By this we will understand that even when we now have an accuracy of 97%, we’d like to look deeper based on context of information by considering different metrics.

So, what we will do now, should we aim for a recall value of 1.0 which implies all of the malignant tumors are identified appropriately, but when we push recall to 1.0 then the precision drops since the model may classify more benign tumors as malignant.

When the model classifies more benign tumors as malignant, there could be unnecessary anxiety, and it could require additional tests or treatments.

Here we should always aim to maximise ‘recall’ by keeping the ‘precision’ reasonably high.

We are able to do that by changing the thresholds set by classifiers to categorise the samples.

A lot of the classifiers set the brink to 0.5, and if we alter it 0.3, we’re saying that even whether it is 30% confident, classify it as malignant.

Now let’s use a custom threshold of 0.3.

Code:

# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predict probabilities
y_probs = model.predict_proba(X_test)[:, 1]

# Apply custom threshold
threshold = 0.3
y_pred_custom = (y_probs >= threshold).astype(int)

# Classification Report
report = classification_report(y_test, y_pred_custom, target_names=["Benign", "Malignant"])

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_custom, labels=[1, 0])

# Plot Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(
    conf_matrix,
    annot=True,
    fmt="d",
    cmap="Purples",
    xticklabels=["Malignant", "Benign"],
    yticklabels=["Malignant", "Benign"]
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix (Threshold = 0.3)")
plt.tight_layout()
plt.show()

Here we applied a custom threshold of 0.3 and generated a confusion matrix and a classification report.

Classification Report:

Here, the accuracy increased to 98% and the recall for malignant increased to 97% and the precision remained the identical.

We earlier discussed that there is likely to be a decrease in precision if we try to maximise the recall but here the precision stays same, this depends upon the info (whether balanced or not), preprocessing steps and tuning the brink.

For medical datasets like this, maximizing recall is commonly preferred over accuracy or precision.

After we consider datasets like spam detection or fraud detection, we prefer precision and same as in above method we try to enhance precision by tuning threshold accordingly and in addition by balancing the tradeoff between precision and recall.

We use f1-score when the info is imbalanced, and once we prefer each precision and recall where neither false positives nor false negatives will be ignored.

Dataset Source
Wisconsin Breast Cancer Dataset
Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license and is free to make use of for industrial or educational purposes so long as proper credit is given to original source.

Here we discussed what a confusion matrix is and the way it’s used to calculate the several classification metrics like accuracy, precision, recall and f1-score.

We also explored when to prioritize which classification metric, using the Wisconsin cancer dataset for instance, where we preferred maximizing recall.

I hope you found this blog helpful in understanding confusion matrix and classification metrics more clearly.

Thanks for reading.

Confusion Matrix Made Easy: Accuracy, Precision, Recall & F1-Rating

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Why Context Is the Latest Currency in AI: From RAG to Context Engineering

Partnering with generative AI within the finance function

Oracle, Larry Ellison money in on AI boom

Fighting Back Against Attacks in Federated Learning

Is Your Training Data Representative? A Guide to Checking with PSI in Python

Confusion Matrix Made Easy: Accuracy, Precision, Recall & F1-Rating

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.