The right way to Evaluate the Performance of Your ML/ AI Models

Artificial Intelligence

The right way to Evaluate the Performance of Your ML/ AI Models

admin

May 20, 2023

The right way to Evaluate the Performance of Your ML/ AI Models

An accurate evaluation is the one approach to performance improvement

Learning by doing is top-of-the-line approaches to learning anything, from tech to a latest language or cooking a latest dish. Once you might have learned the fundamentals of a field or an application, you possibly can construct on that knowledge by acting. Constructing models for various applications is one of the best approach to make your knowledge concrete regarding machine learning and artificial intelligence.

Though each fields (or really sub-fields, since they do overlap) have applications in a wide range of contexts, the steps to learning easy methods to construct a model are roughly the identical whatever the goal application field.

AI language models corresponding to ChatGPT and Bard are gaining popularity and interest from each tech novices and general audiences because they might be very useful in our each day lives.

Now that more models are being released and presented, one may ask, what makes a “good” AI/ ML model, and the way can we evaluate the performance of 1?

That is what we’re going to cover in this text. But again, we assume you have already got an AI or ML model built. Now, you need to evaluate and improve its performance (if crucial). But, again, whatever the kind of model you might have and your end application, you possibly can take steps to guage your model and improve its performance.

To assist us follow through with the concepts, let’s use the Wine dataset from sklearn [1], apply the support vector classifier (SVC), after which test its metrics.

So, let’s jump right in…

First, let’s import the libraries we are going to use (don’t worry about what each of those do now, we’ll get to that!).

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt

Now, we read our dataset, apply the classifier, and evaluate it.

wine_data = datasets.load_wine()
X = wine_data.data
y = wine_data.goal

Depending in your stage in the training process, you might need access to a considerable amount of data which you can use for training and testing, and evaluating. Also, you should utilize different data to coach and test your model because that may prevent you from genuinely assessing your model’s performance.

To beat that challenge, split your data into three smaller random sets and use them for training, testing, and validating.

An excellent rule of thumb to try this split is a 60,20,20 approach. You’ll use 60% of the information for training, 20% for validation, and 20% for testing. You want to shuffle your data before you do the split to make sure a greater representation of that data.

I do know which will sound complicated, but luckily, ticket-learn got here to the rescue by offering a function to perform that split for you, train_test_split().

So, we will take our dataset and split it like so:

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, train_size=0.60, random_state=1, stratify=y)

Then use the training portion of it as input to the classifier.

#Scale data
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#Apply SVC model
svc = SVC(kernel='linear', C=10.0, random_state=1)
svc.fit(X_train, Y_train)
#Obtain predictions
Y_pred = svc.predict(X_test)

At this point, we have now some results to “evaluate.”

Before starting the evaluation process, we must ask ourselves an important query in regards to the model we use: what would make this model good?

The reply to this query relies on the model and the way you propose to make use of it. That being said, there are standard evaluation metrics that data scientists use once they need to test the performance of an AI/ ML model, including:

Accuracy is the proportion of correct predictions by the model out of the whole prediction. Meaning, once I run the model, what number of predictions are true amongst all predictions? This text goes into depth about testing the accuracy of a model.
Precision is the proportion of true positive predictions by the model out of all positive predictions. Unfortunately, precision and accuracy are sometimes confused; one approach to make the difference between them clear is to consider accuracy because the closeness of the predictions to the actual values, while precision is how close the proper predictions are to one another. So, accuracy is an absolute measure, yet each are vital to guage the model’s performance.
Recall is the proportion of true positive predictions from all actual positive instances within the dataset. Recall goals to seek out related predictions inside a dataset. Mathematically, if we increase the recall, we decrease the precision of the model.
F1 rating is the combination mean of precision and recall, providing a balanced measure of a model’s performance using each precision and recall. This video by CodeBasics discusses the relation between precision, recall, and F1 rating and easy methods to find the optimal balance of those evaluation metrics.

Video By CodeBasics

Now, let’s calculate different metrics for the anticipated data. The best way we are going to try this is by first displaying the confusion matrix. The confusion matrix is just the actual results of information vs. the anticipated results.

conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
#Plot the confusion matrix
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3)
for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large')
plt.xlabel('Predicted Values', fontsize=18)
plt.ylabel('Actual Values', fontsize=18)
plt.show()

The confusion matrix to our dataset will look something like,

If we have a look at this confusion matrix, we will see that the actual value was “1” in some cases while the anticipated value was “0”. Which implies the classifier will not be a %100 accurate.

We are able to calculate this classifier’s accuracy, precision, recall, and f1 rating using this code.

print('Precision: %.3f' % precision_score(Y_test, Y_pred, average='micro'))
print('Recall: %.3f' % recall_score(Y_test, Y_pred, average='micro'))
print('Accuracy: %.3f' % accuracy_score(Y_test, Y_pred))
print('F1 Rating: %.3f' % f1_score(Y_test, Y_pred, average='micro'))

For this particular example, the outcomes for those are:

Precision = 0.889
Recall = 0.889
Accuracy = 0.889
F1 rating = 0.889

Though you possibly can really use different approaches to guage your models, some evaluation methods will higher estimate the model’s performance based on the model type. For instance, along with the above methods, if the model you’re evaluating is a regression (or it includes regression) model, you may as well use:

– Mean Squared Error (MSE) mathematically is the typical of the squared differences between predicted and actual values.

– Mean Absolute Error (MAE) is the typical of absolutely the differences between predicted and actual values.

Those two metrics are closely related, but implementation-wise, MAE is less complicated (no less than mathematically) than MSE. Nevertheless, MAE doesn’t do well with significant errors, unlike MSE, which emphasizes the errors (since it squares them).

Before discussing hyperparameters, let’s first differentiate between a hyperparameter and a parameter. A parameter is a way a model is defined to resolve an issue. In contrast, hyperparameters are used to check, validate, and optimize the model’s performance. Hyperparameters are sometimes chosen by the information scientists (or the client, in some cases) to manage and validate the training strategy of the model and hence, its performance.

There are various kinds of hyperparameters which you can use to validate your model; some are general and might be used on any model, corresponding to:

Learning Rate: this hyperparameter controls how much the model must be modified in response to some error when the model’s parameters are updated or altered. Selecting the optimal learning rate is a trade-off with the time needed for the training process. If the training rate is low, then it might decelerate the training process. In contrast, if the training rate is simply too high, the training process can be faster, however the model performance may suffer.
Batch Size: The dimensions of your training dataset will significantly affect the model’s training time and learning rate. So, finding the optimal batch size is a skill that is commonly developed as you construct more models and grow your experience.
Variety of Epochs: An epoch is an entire cycle for training the machine learning model. The variety of epochs to make use of varies from one model to a different. Theoretically, more epochs result in fewer errors within the validation process.

Along with the above hyperparameters, there are model-specific hyperparameters corresponding to regularization strength or the variety of hidden layers in implementing a neural network. This 15 mins Video by APMonitor explores various hyperparameters and their differences.

Video by APMonitor

Validating an AI/ ML model will not be a linear process but more of an iterative one. You undergo the information split, the hyperparameters tuning, analyzing, and validating the outcomes often greater than once. The variety of times you repeat that process relies on the evaluation of the outcomes. For some models, you might only need to do that once; for others, you might must do it a few times.

If it is advisable repeat the method, you’ll use the insights from the previous evaluation to enhance the model’s architecture, training process, or hyperparameter settings until you might be satisfied with the model’s performance.

Once you start constructing your individual ML and AI models, you’ll quickly realize that selecting and implementing the model is the simple a part of the workflow. Nevertheless, testing and evaluation is the part that may take a lot of the development process. Evaluating an AI/ ML model is an iterative and infrequently time-consuming process, and it requires careful evaluation, experimentation, and fine-tuning to attain the specified performance.

Luckily, the more experience you might have constructing more models, the more systematic the strategy of evaluating your model’s performance will get. And it’s a worthwhile skill considering the importance of evaluating your model, corresponding to:

Evaluating our models allows us to objectively measures the model’s metrics which helps in understanding its strengths and weaknesses and provides insights into its predictive or decision-making capabilities.
If different models that may solve the identical problems exist, then evaluating them enables us to check their performance and select the one which suits our application best.
Evaluation provides insights into the model’s weaknesses, allowing for improvements through analyzing the errors and areas where the model underperforms.

So, have patience and keep constructing models; it gets higher and more efficient with the more models you construct. Don’t let the method details discourage you. It might seem like a posh process, but when you understand the steps, it’ll grow to be second nature to you.

[1] Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California,
School of Information and Computer Science. (CC BY 4.0)

An accurate evaluation is the one approach to performance improvement

LEAVE A REPLY Cancel reply