Mastering Logistic Regression Background: Binary Classification Problems The Logistic Regression Model Log Loss Gradient Descent Implementation in Python The LogisticRegression Class in Scikit-Learn Summary

Artificial Intelligence

Mastering Logistic Regression Background: Binary Classification Problems The Logistic Regression Model Log Loss Gradient Descent Implementation in Python The LogisticRegression Class in Scikit-Learn Summary

admin

May 23, 2023

Mastering Logistic Regression
Background: Binary Classification Problems
The Logistic Regression Model
Log Loss
Gradient Descent
Implementation in Python
The LogisticRegression Class in Scikit-Learn
Summary

The next plot shows the log loss when y = 1:

The log loss equals to 0 only in case of an ideal prediction (p = 1 and y = 1, or p = 0 and y = 0), and approaches infinity because the prediction gets worse (i.e., when y = 1 and p → 0 or y = 0 and p → 1).

Thecalculates the typical loss over the entire data set:

This function may be written in vectorized form as follows:

where = (y₁, …, yₙ) is a vector that comprises the labels of all of the training samples, and = (p₁, …, pₙ) is a vector that comprises the expected probabilities of the model for all of the training samples.

This cost function is convex, i.e., it has a single global minimum. Nonetheless, there isn’t any closed-form solution for locating the optimal * (as a consequence of the non-linearities introduced by the log function). Due to this fact, we want to make use of iterative optimization methods corresponding to gradient descent with a view to find the minimum.

Gradient descent is an iterative approach for locating a minimum of a function, where we take small steps in the other way of the gradient with a view to catch up with to the minimum:

In an effort to use gradient descent to search out the minimum of the associated fee J(), we want to compute its partial derivatives with respect to every one among the weights. The partial derivative of J() with respect to a given weight wⱼ is:

Thus, the gradient vector may be written in vectorized form as follows:

And the gradient descent update rule is:

where α is a learning rate that controls the step size (0 < α < 1).

Note that at any time when you utilize gradient descent, you should make certain that your data set is (otherwise gradient descent may take steps of various sizes in numerous directions, which is able to make it unstable).

We are going to now implement the logistic regression model in Python from scratch, including the associated fee function and gradient computation, optimizing the model using gradient descent, evaluation of the model, and plotting the ultimate decision boundary.

For the demonstration we’ll use the Iris data set (BSD license). The unique data set comprises 150 samples of Iris flowers that belong to one among three species (setosa, versicolor and virginica). We are going to change it right into a binary classification problem through the use of only the primary two forms of flowers (setosa and versicolor). As well as, we’ll use only the primary two features of every flower (sepal width and sepal length).

Loading the Data Set

Let’s first import the required libraries and fix the random seed with a view to get reproducible results:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsnp.random.seed(0)

Next, we load the info set:

from sklearn.datasets import load_irisiris = load_iris()
X = iris.data[:, :2]  # Take only the primary two features
y = iris.goal
# Take only the setosa and versicolor flowers
X = X[(y == 0) | (y == 1)]
y = y[(y == 0) | (y == 1)]

Let’s plot the info:

def plot_data(X, y):
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], style=iris.target_names[y], 
palette=['r','b'], markers=('s','o'), edgecolor='k')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.legend()

plot_data(X, y)

As may be seen, the info set is linearly separable, subsequently logistic regression should give you the chance to search out the boundary between the 2 classes.

Next, we want so as to add a column of ones to the features matrix X with a view to represent the bias (w₀):

# Add a column for the bias
n = X.shape[0] 
X_with_bias = np.hstack((np.ones((n, 1)), X))

We now split the info set into training and test sets:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)

Model Implementation

We at the moment are able to implement the logistic regression model. We start by defining a helper function to compute the sigmoid function:

def sigmoid(z):
""" Compute the sigmoid of z (z is usually a scalar or a vector). """
z = np.array(z)
return 1 / (1 + np.exp(-z))

Next, we implement the associated fee function that returns the associated fee of a logistic regression model with parameters on a given data set (X, ), and likewise its gradient with respect to .

def cost_function(X, y, w):
""" J, grad = cost_function(X, y, w) computes the associated fee of a logistic regression model 
with parameters w and the gradient of the associated fee w.r.t. to the parameters. """
# Compute the associated fee
p = sigmoid(X @ w)    
J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) # Compute the gradient    
grad = (1/n) * X.T @ (p - y)  
return J, grad

Note that we’re using the vectorized types of the associated fee and the gradient functions which have been shown previously.

To sanity check this function, let’s compute the associated fee and gradient of the model on some random weight vector:

w = np.random.rand(X_train.shape[1])
cost, grad = cost_function(X_train, y_train, w)print('w:', w)
print('Cost at w:', cost)
print('Gradient at w:', grad)

The output we get is:

w: [0.5488135  0.71518937 0.60276338]
Cost at w: 2.314505839067951
Gradient at w: [0.36855061 1.86634895 1.27264487]

Gradient Descent Implementation

We are going to now implement gradient descent with a view to find the optimal * that minimizes the associated fee function on a given training set. The algorithm will run at most max_iter passes over the training set (defaults to 5000), unless the associated fee has not decreased by not less than tol for the reason that previous iteration (defaults to 0.0001), during which case the training stops immediately.

def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):
""" Optimize the model using gradient descent.
X, y: The training set        
alpha: The training rate
max_iter: The utmost variety of passes over the training set (epochs)
tol: The stopping criterion. Training will stop when (new_cost > cost - tol)
"""
w = np.random.rand(X.shape[1])
cost, grad = cost_function(X, y, w)for i in range(max_iter):
w = w - alpha * grad
new_cost, grad = cost_function(X, y, w)        
if new_cost > cost - tol:
print(f'Converged after {i} iterations')
return w, new_cost
cost = new_cost
print('Maximum variety of iterations reached')
return w, cost

Normally at this point you would need to normalize your data set, since gradient descent doesn’t work well with features which have different scales. In our specific data set normalization isn’t essential for the reason that ranges of the 2 features are similar.

Let’s now call this function to optimize our model:

opt_w, cost = optimize_model(X_train, y_train)print('opt_w:', opt_w)
print('Cost at opt_w:', cost)

The algorithm converges after 1,413 iterations and the optimal * we get is:

Converged after 1413 iterations
opt_w: [ 0.28014029  0.80541854 -1.48367938]
Cost at opt_w: 0.28389717767222555

There are other solvers you should use for the optimization which are sometimes faster than gradient descent, corresponding to conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for more details on methods to use these optimizers.

Using the Model for Predictions

Now that we’ve got found the optimal parameters of the model, we are able to use it for predictions.

First, let’s write a function that gets a matrix of latest samples X and returns their probabilities of belonging to the positive class:

def predict_prob(X, w):
""" Return the probability that samples in X belong to the positive class
X: the feature matrix (every row in X represents one sample)
w: the learned logistic regression parameters  
"""
p = sigmoid(X @ w)
return p

The function computes the predictions of the model by simply taking the sigmoid of Xᵗ(which computes σ(ᵗ) for each row within the matrix).

For instance, let’s find the probability that a sample positioned at (6, 2) belongs to the versicolor class:

predict_prob([[1, 6, 2]], opt_w)

array([0.89522808])

This sample has 89.52% likelihood of being a versicolor flower. This is sensible since this sample is positioned well inside the area of the versicolor flowers removed from the border between the classes.

Then again, the probability that a sample positioned at (5.5, 3) belongs to the versicolor class is:

predict_prob([[1, 5.5, 3]], opt_w)

array([0.56436688])

This time the probability is way lower (only 56.44%), since this sample is near the border between the classes.