Deep Dive into Softmax Regression Background: Multi-Class Classification Problems The Softmax Regression Model Cross-Entropy Loss Gradient Descent Practice Query Softmax Regression in Scikit-Learn Summary

Artificial Intelligence

Deep Dive into Softmax Regression Background: Multi-Class Classification Problems The Softmax Regression Model Cross-Entropy Loss Gradient Descent Practice Query Softmax Regression in Scikit-Learn Summary

admin

May 27, 2023

Deep Dive into Softmax Regression
Background: Multi-Class Classification Problems
The Softmax Regression Model
Cross-Entropy Loss
Gradient Descent
Practice Query
Softmax Regression in Scikit-Learn
Summary

With these gradients, we will use (stochastic) gradient descent to reduce the loss function on the given training set.

You might be given a set of images and you must classify them into dogs/cats and outdoor/indoor. Must you implement two logistic regression classifiers or one softmax regression classifier?

The answer will be found at the top of the article.

The category LogisticRegression can handle each binary and multi-class classification problems. It has a parameter called multi_class which by default is ready to ‘auto’. The meaning of this feature is that Scikit-Learn will routinely apply a softmax regression at any time when it detects that the issue is multi-class and the chosen solver supports optimization of the multinomial loss (all solvers support it aside from ‘liblinear’).

Example: Classifying Handwritten Digits

For instance, let’s train a softmax regression model on the MNIST data set, which is a widely used data set for image classification tasks.

The information set incorporates 60,000 training images and 10,000 testing images of handwritten digits. Each image is 28 × 28 pixels in size, and is usually represented by a vector of 784 numbers within the range [0, 255]. The duty is to categorise these images into one among the ten digits (0–9).

Loading the Data Set

We first fetch the MNIST data set using the fetch_openml() function:

from sklearn.datasets import fetch_openmlX, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)

Let’s examine the form of X:

print(X.shape)

(70000, 784)

X consists of 70,000 vectors, each has 784 pixels.

Let’s display the primary 50 digits in the info set:

fig, axes = plt.subplots(5, 10, figsize=(10, 5))
i = 0
for ax in axes.flat:
ax.imshow(X[i].reshape(28, 28), cmap='binary')
ax.axis('off')    
i += 1

The primary 50 digits from the MNIST data set

Next, we scale the inputs to be inside the range [0, 1] as a substitute of [0, 255]:

X = X / 255

Feature scaling is significant at any time when you employ an iterative optimization method equivalent to gradient descent to coach your model.

We now split the info into training and test sets. Note that the primary 60,000 images in MNIST are already designated for training, so we will just use easy slicing for the split:

train_size = 60000
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]

Constructing the Model

We now create a LogisticRegression classifier with its default settings and fit it to the training set:

from sklearn.linear_model import LogisticRegressionclf = LogisticRegression()
clf.fit(X_train, y_train)

We get a warning message that the utmost variety of iterations has been reached:

ConvergenceWarning: lbfgs didn't converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.Increase the variety of iterations (max_iter) or scale the info as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also consult with the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(

Let’s increase max_iter to 1000 (as a substitute of the default 100):

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

This time the educational has converged before reaching the utmost variety of iterations. We are able to actually check what number of iterations were needed for convergence by inspecting the n_iter_ attribute:

print(clf.n_iter_)

[795]

It took 795 iterations for the educational to converge.

Evaluating the Model

The accuracy of the model on the training and the test sets is:

print('Training set accuracy: ', np.round(clf.rating(X_train, y_train), 4))
print('Test set accuracy:' , np.round(clf.rating(X_test, y_test), 4))

Training set accuracy: 0.9393
Test set accuracy: 0.9256

These results are good, but recent deep neural networks can achieve significantly better results on this data set (as much as 99.91% accuracy on the test!). The softmax regression model is roughly such as a neural network with a single layer of perceptrons that use the softmax activation function. Subsequently, it isn’t surprising that a deep network can achieve higher results than our model.

To know higher the errors of our model, let’s display its confusion matrix:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplayy_test_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(cmap='Blues')

We are able to see that the principal confusions of the model are between the digits 5⇔8 and 4⇔9. This is smart since these digits often resemble one another when written by hand. To assist our model distinguish between these digits, we will add more examples from these digits (e.g., by utilizing data augmentation) or extract additional features from the pictures (e.g., the variety of closed loops within the digit).

We may print the classification report back to get the precision, recall and F1 rating for every class:

from sklearn.metrics import classification_reportprint(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support0       0.95      0.97      0.96       980
1       0.96      0.98      0.97      1135
2       0.93      0.90      0.91      1032
3       0.90      0.92      0.91      1010
4       0.94      0.94      0.94       982
5       0.90      0.87      0.88       892
6       0.94      0.95      0.95       958
7       0.93      0.92      0.93      1028
8       0.88      0.88      0.88       974
9       0.91      0.92      0.91      1009
accuracy                           0.93     10000
macro avg       0.92      0.92      0.92     10000
weighted avg       0.93      0.93      0.93     10000

As expected, the digits that the model gets the bottom scores on are 5 and eight.

Visualizing the Weights

One among the benefits of softmax regression is that it is extremely interpretable (unlike “black box” models equivalent to neural networks). The load related to each feature represents the importance of that feature.

For instance, we will plot the weights related to each pixel in each one among the digit classes (ⱼ for every j ∈ {1, …, 10}). It will show us the essential segments in the pictures which are used to detect each digit.

The weights matrix of the model is stored in an attribute called coef_:

print(clf.coef_.shape)

(10, 784)

Row i of this matrix incorporates the learned weights of the model for sophistication i. We are able to display each row as a 28 × 28 pixels image to be able to examine the weights related to each pixel in each one among the classes:

fig, axes = plt.subplots(2, 5, figsize=(15, 5))digit = 0
for coef, ax in zip(clf.coef_, axes.flat):
im = ax.imshow(coef.reshape(28, 28), cmap='gray')
ax.axis('off')
ax.set_title(str(digit))
digit += 1
fig.colorbar(im, ax=axes.flat)

Pixels with shiny shades have a positive impact on the prediction while pixels with dark shades have a negative impact. Pixels with a gray level around 0 don’t have any influence on the prediction (equivalent to the pixels near the border of the image).

The professionals and cons of softmax regression as in comparison with other multi-class classification models are:

Provides class probability estimates
Highly scalable, requiring plenty of parameters linear within the variety of features
Highly interpretable (the load related to each feature represents its importance)
Can handle redundant features (by assigning them weights near 0)

Can find only linear decision boundaries between the classes
Normally outperformed by more complex models
Cannot take care of missing values

Example: Classifying Handwritten Digits

Loading the Data Set

Constructing the Model

Evaluating the Model

Visualizing the Weights

LEAVE A REPLY Cancel reply