Exploring Latest Hyperparameter Dimensions with Laplace Approximated Bayesian Optimization

-

Is it higher than grid search?

Image by writer from canva

Once I notice my model is overfitting, I often think, “It’s time to regularize”. But how do I resolve which regularization method to make use of (L1, L2) and what parameters to decide on? Typically, I perform hyperparameter optimization by way of a grid search to pick out the settings. Nevertheless, what happens if the independent variables have different scales or various levels of influence? Can I design a hyperparameter grid with different regularization coefficients for every variable? Is such a optimization feasible in high-dimensional spaces? And are there alternative routes to design regularization? Let’s explore this with a hypothetical example.

My fictional example is a binary classification use case with 3 explanatory variables. Each of those variables is categorical and has 7 different categories. My reproducible use case is on this notebook. The function that generates the dataset is the next:

import numpy as np
import pandas as pd

def get_classification_dataset():
n_samples = 200
cats = ["a", "b", "c", "d", "e", "f"]
X = pd.DataFrame(
data={
"col1": np.random.alternative(cats, size=n_samples),
"col2": np.random.alternative(cats, size=n_samples),
"col3": np.random.alternative(cats, size=n_samples),
}
)
X_preprocessed = pd.get_dummies(X)

theta = np.random.multivariate_normal(
np.zeros(len(cats) * X.shape[1]),
np.diag(np.array([1e-1] * len(cats) + [1] * len(cats) + [1e1] * len(cats))),
)

y = pd.Series(
data=np.random.binomial(1, expit(np.dot(X_preprocessed.to_numpy(), theta))),
index=X_preprocessed.index,
)
return X_preprocessed, y

For information, I deliberately selected 3 different values for the theta covariance matrix to showcase the good thing about the Laplace approximated bayesian optimization method. If the values were by some means similar, the interest can be minor.

Together with an easy baseline model that predicts the mean observed value on the training dataset (used for comparison purposes), I opted to design a rather more complex model. I made a decision to one-hot encode the three independent variables and apply a logistic regression model on top of this basic preprocessing. For regularization, I selected an L2 design and aimed to search out the optimal regularization coefficient using two techniques: grid search and Laplace approximated bayesian optimization, as you’ll have anticipated by now. Finally, I evaluated the model on a test dataset using two metrics (arbitrarily chosen): log loss and AUC ROC.

Before presenting the outcomes, let’s first take a more in-depth take a look at the bayesian model and the way we optimize it.

Within the bayesian framework, the parameters are not any longer fixed constants, but random variables. As a substitute of maximizing the likelihood to estimate these unknown parameters, we now optimize the posterior distribution of the random parameters, given the observed data. This requires us to decide on, often somewhat arbitrarily, the design and parameters of the prior. Nevertheless, it’s also possible to treat the parameters of the prior as random variables themselves — like in Inception, where the layers of uncertainty keep stacking on top of one another…

On this study, I actually have chosen the next model:

I actually have logically chosen a bernouilli model for Y_i | θ, a standard centered prior corresponding to a L2 regularization for θ | Σ and at last for Σ_i^{-1}, I selected a Gamma model. I selected to model the precision matrix as a substitute of the covariance matrix because it is traditional within the literature, like in scikit learn user guide for the Bayesian linear regression [2].

Along with this written model, I assumed the Y_i and Y_j are conditionally (on θ) independent in addition to Y_i and Σ.

Likelihood

Based on the model, the likelihood can consequently be written:

With the intention to optimize, we want to judge nearly all the terms, except P(Y=y). The terms within the numerators might be evaluated using the chosen model. Nevertheless, the remaining term within the denominator cannot. That is where the Laplace approximation comes into play.

Laplace approximation

With the intention to evaluate the primary term of the denominator, we will leverage the Laplace approximation. We approximate the distribution of θ | Y, Σ by:

with θ* being the mode of the mode the density distribution of θ | Y, Σ.

Despite the fact that we have no idea the density function, we will evaluate the Hessian part due to the next decomposition:

We only must know the primary two terms of the numerator to judge the Hessian which we do.

For those all for further explanation, I counsel part 4.4, “The Laplace Approximation”, from Pattern Recognition and Machine Learning from Christopher M. Bishop [1]. It helped me loads to know the approximation.

Laplace approximated likelihood

Finally the Laplace approximated likelihood to optimize is:

Once we approximate the density function of θ | Y, Σ, we could finally evaluate the likelihood at whatever θ we would like if the approximation was accurate all over the place. For the sake of simplicity and since the approximation is accurate only near the mode, we evaluate approximated likelihood at θ*.

Here after is a function that evaluates this loss for a given (scalar) σ²=1/p (along with the given observed, X and y, and design values, α and β).

import numpy as np
from scipy.stats import gamma

from module.bayesian_model import BayesianLogisticRegression

def loss(p, X, y, alpha, beta):
# computation of the loss for given values:
# - 1/sigma² (named p for precision here)
# - X: matrix of features
# - y: vector of observations
# - alpha: prior Gamma distribution alpha parameter over 1/sigma²
# - beta: prior Gamma distribution beta parameter over 1/sigma²

n_feat = X.shape[1]
m_vec = np.array([0] * n_feat)
p_vec = np.array(p * n_feat)

# computation of theta*
res = minimize(
BayesianLogisticRegression()._loss,
np.array([0] * n_feat),
args=(X, y, m_vec, p_vec),
method="BFGS",
jac=BayesianLogisticRegression()._jac,
)
theta_star = res.x

# computation the Hessian for the Laplace approximation
H = BayesianLogisticRegression()._hess(theta_star, X, y, m_vec, p_vec)

# loss
loss = 0
## first two terms: the log loss and the regularization term
loss += baysian_model._loss(theta_star, X, y, m_vec, p_vec)
## third term: prior distribution over sigma, written p here
out -= gamma.logpdf(p, a = alpha, scale = 1 / beta)
## fourth term: Laplace approximated last term
out += 0.5 * np.linalg.slogdet(H)[1] - 0.5 * n_feat * np.log(2 * np.pi)

return out

In my use case, I actually have chosen to optimize it by way of Adam optimizer, which code has been taken from this repo.

def adam(
fun,
x0,
jac,
args=(),
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
eps=1e-8,
startiter=0,
maxiter=1000,
callback=None,
**kwargs
):
"""``scipy.optimize.minimize`` compatible implementation of ADAM -
[http://arxiv.org/pdf/1412.6980.pdf].
Adapted from ``autograd/misc/optimizers.py``.
"""
x = x0
m = np.zeros_like(x)
v = np.zeros_like(x)

for i in range(startiter, startiter + maxiter):
g = jac(x, *args)

if callback and callback(x):
break

m = (1 - beta1) * g + beta1 * m # first moment estimate.
v = (1 - beta2) * (g**2) + beta2 * v # second moment estimate.
mhat = m / (1 - beta1**(i + 1)) # bias correction.
vhat = v / (1 - beta2**(i + 1))
x = x - learning_rate * mhat / (np.sqrt(vhat) + eps)

i += 1
return OptimizeResult(x=x, fun=fun(x, *args), jac=g, nit=i, nfev=i, success=True)

For this optimization we want the derivative of the previous loss. We cannot have an analytical form so I made a decision to make use of a numerical approximation of the derivative.

Once the model is trained on the training dataset, it’s crucial to make predictions on the evaluation dataset to evaluate its performance and compare different models. Nevertheless, it just isn’t possible to directly calculate the actual distribution of a brand new point, because the computation is intractable.

It is feasible to approximate the outcomes with:

considering:

I selected an uninformative prior over the precision random variable. The naive model performs poorly, with a log lack of 0.60 and an AUC ROC of 0.50. The second model performs higher, with a log lack of 0.44 and an AUC ROC of 0.83, each when hyperoptimized using grid search and bayesian optimization. This means that the logistic regression model, which contains the dependent variables, outperforms the naive model. Nevertheless, there isn’t any advantage to using bayesian optimization over grid search, so I’ll proceed with grid seek for now. Thanks for reading.

… But wait, I’m pondering. Why are my parameters regularized with the identical coefficient? Shouldn’t my prior depend upon the underlying dependent variables? Perhaps the parameters for the primary dependent variable could take higher values, while those for the second dependent variable, with its smaller influence, ought to be closer to zero. Let’s explore these recent dimensions.

To date we have now considered two techniques, the grid search and the bayesian optimization. We will use these same techniques in higher dimensions.

Considering recent dimensions could dramatically increase the variety of nodes of my grid. Because of this the bayesian optimization is sensible in higher dimensions to get the most effective regularization coefficients. Within the considered use case, I actually have supposed there are 3 regularization parameters, one for every independent variable. After encoding a single variable, I assumed the generated recent variables all shared the identical regularization parameter. Hence the whole regularization parameters of three, even when there are greater than 3 columns as inputs of the logistic regression.

I updated the previous loss function with the next code:

import numpy as np
from scipy.stats import gamma

from module.bayesian_model import BayesianLogisticRegression

def loss(p, X, y, alpha, beta, X_columns, col_to_p_id):
# computation of the loss for given values:
# - 1/sigma² vector (named p for precision here)
# - X: matrix of features
# - y: vector of observations
# - alpha: prior Gamma distribution alpha parameter over 1/sigma²
# - beta: prior Gamma distribution beta parameter over 1/sigma²
# - X_columns: list of names of X columns
# - col_to_p_id: dictionnary mapping a column name to a p index
# because many column names can share the identical p value

n_feat = X.shape[1]
m_vec = np.array([0] * n_feat)
p_list = []
for col in X_columns:
p_list.append(p[col_to_p_id[col]])
p_vec = np.array(p_list)

# computation of theta*
res = minimize(
BayesianLogisticRegression()._loss,
np.array([0] * n_feat),
args=(X, y, m_vec, p_vec),
method="BFGS",
jac=BayesianLogisticRegression()._jac,
)
theta_star = res.x

# computation the Hessian for the Laplace approximation
H = BayesianLogisticRegression()._hess(theta_star, X, y, m_vec, p_vec)

# loss
loss = 0
## first two terms: the log loss and the regularization term
loss += baysian_model._loss(theta_star, X, y, m_vec, p_vec)
## third term: prior distribution over 1/sigma² written p here
## there's now a sum as p is now a vector
out -= np.sum(gamma.logpdf(p, a = alpha, scale = 1 / beta))
## fourth term: Laplace approximated last term
out += 0.5 * np.linalg.slogdet(H)[1] - 0.5 * n_feat * np.log(2 * np.pi)

return out

With this approach, the metrics evaluated on the test dataset are the next: 0.39 and 0.88, that are higher than the initial model optimized by way of a grid search and a bayesian approach with only a single prior for all of the independent variables.

Metrics achieved with the several methods on my use case.

The use case might be reproduced with this notebook.

I actually have created an example for instance the usefulness of the technique. Nevertheless, I actually have not been capable of find an appropriate real-world dataset to totally show its potential. While I used to be working with an actual dataset, I couldn’t derive any significant advantages from applying this method. In the event you come across one, please let me know — I can be excited to see a real-world application of this regularization method.

In conclusion, using bayesian optimization (with Laplace approximation if needed) to find out the most effective regularization parameters could also be alternative to traditional hyperparameter tuning methods. By leveraging probabilistic models, bayesian optimization not only reduces the computational cost but in addition enhances the likelihood of finding optimal regularization values, especially in high dimension.

  1. Christopher M. Bishop. (2006). Pattern Recognition and Machine Learning. Springer.
  2. Bayesian Ridge Regression scikit-learn user guide: https://scikit-learn.org/1.5/modules/linear_model.html#bayesian-ridge-regression
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x