Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models

to tune hyperparamters of deep learning models (Keras Sequential model), compared with a conventional approach — Grid Search.

Bayesian Optimization

Bayesian Optimization is a sequential design strategy for global optimization of black-box functions.

It is especially well-suited for functions which might be expensive to judge, lack an analytical form, or have unknown derivatives.
Within the context of hyperparameter optimization, the unknown function will be:

an objective function,
accuracy value for a training or validation set,
loss value for a training or validation set,
entropy gained or lost,
AUC for ROC curves,
A/B test results,
computation cost per epoch,
model size,
reward amount for reinforcement learning, and more.

Unlike traditional optimization methods that depend on direct function evaluations, Bayesian Optimization builds and refines a probabilistic model of the target function, using this model to intelligently select the subsequent evaluation point.

The core idea revolves around two key components:

1. Surrogate Model (Probabilistic Model)

The model approximates the unknown objective function (f(x)) to a surrogate model reminiscent of Gaussian Process (GP).

A GP is a non-parametric Bayesian model that defines a distribution over functions. It provide:

a prediction of the function value at a given point μ(x) and
a measure of uncertainty around that prediction σ(x), often represented as a confidence interval.

Mathematically, for a Gaussian Process, the predictions at an unobserved point (x∗), given observed data (X, y), are normally distributed:

where

μ(x∗): the mean prediction and
σ²(x∗): the predictive variance.

2. Acquisition Function

The acquisition function determines a next point (x_t+1) to judge by quantifying how “promising” a candidate point is for improving the target function, by balancing:

Exploration (High Variance): Sampling in areas with to find latest promising regions and
Exploitation (High Mean): Sampling in areas where the surrogate model predicts

Common acquisition functions include:

Probability of Improvement (PI)
PI selects the purpose that has the very best probability of improving upon the present best observed value (f(x+)):

where

Φ: the cumulative distribution function (CDF) of the usual normal distribution, and
ξ≥0 is a trade-off parameter (exploration vs. exploitation).

ξ controls a trade-off between exploration and exploitation, and a bigger ξ encourages more exploration.

Expected Improvement (EI)
Quantifies the expected amount of improvement over the present best observed value:

Assuming a Gaussian Process surrogate, the analytical type of EI is defined:

where ϕ is the probability density function (PDF) of the usual normal distribution.

EI is probably the most widely used acquisition functions. EI also considers the magnitude of the development unlike PI.

Upper Confidence Certain (UCB)
UCB balances exploitation (high mean) and exploration (high variance), specializing in points which have each a high predicted mean and high uncertainty:

κ≥0 is a tuning parameter that controls the balance between exploration and exploitation.

A bigger κ puts more emphasis on exploring uncertain regions.

Bayesian Optimization Strategy (Iterative Process)

Bayesian Optimization iteratively updates the surrogate model and optimizes the acquisition function.

It guides the search towards optimal regions while minimizing the number of high-priced objective function evaluations.

Now, allow us to see the method with code snippets using KerasTuner for a fraud detection task (binary classification where y=1 (fraud) costs us probably the most.)

Step 1. Initialization

Initializes the method by sampling the hyperparameter space randomly or low-discrepancy sequencing (ususally picking up 5 to 10 points) to get an idea of the target function.

These initial observations are used to construct the primary version of the surrogate model.

As we construct Keras Sequential model, we first define and compile the model, then define theBayesianOptimization tuner with the variety of initial points to evaluate.

import keras_tuner as kt
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input

# initialize a Keras Sequential model
model = Sequential([
    Input(shape=(self.input_shape,)),
    Dense(
        units=hp.Int(
            'neurons1', min_value=20, max_value=60, step=10),
             activation='relu'
    ),
    Dropout(
        hp.Float(
             'dropout_rate1', min_value=0.0, max_value=0.5, step=0.1
    )),
    Dense(
        units=hp.Int(
            'neurons2', min_value=20, max_value=60, step=10),
            activation='relu'
    ),
    Dropout(
         hp.Float(
              'dropout_rate2', min_value=0.0, max_value=0.5, step=0.1
    )),
    Dense(
         1, activation='sigmoid', 
         bias_initializer=keras.initializers.Constant(
             self.initial_bias_value
        )
    )
])

# compile the model
model.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')
    ]
)

# define a tuner with the intial points
tuner = kt.BayesianOptimization(
    hypermodel=custom_hypermodel,
    objective=kt.Objective("val_recall", direction="max"), 
    max_trials=max_trials,
    executions_per_trial=executions_per_trial,
    directory=directory,
    project_name=project_name,
    num_initial_points=num_initial_points,
    overwrite=True,
)

num_initial_points defines what number of initial, randomly chosen hyperparameter configurations must be evaluated the algorithm starts to guide the search.

If not given, KerasTuner takes a default value: 3 * dimensions of the hyperparameter space.

Step 2. Surrogate Model Training

Construct and train the probabilistic model (surrogate model, often a Gaussian Process or a Tree-structured Parzen Estimator for Bayesian Optimization) using all available observed datas points (input values and their corresponding output values) to approximate the true function.

The surrogate model provides the mean prediction (μ(x)) (more than likely from the Gaussian process) and uncertainty (σ(x)) for any unobserved point.

KerasTuner uses an internal surrogate model to model the connection between hyperparameters and the target function’s performance.

After each objective function evaluation via train run, the observed data points (hyperparameters and validation metrics) are used to update the inner surrogate model.

Step 3. Acquisition Function Optimization

Use an optimization algorithm (often an affordable, local optimizer like L-BFGS and even random search) to seek out the subsequent point (x_t+1) that maximizes the chosen acquisition function.

This step is crucial because it identifies probably the most promising next candidate for evaluation by balancing exploration (trying latest, uncertain areas of the hyperparameter space) and exploitation (refining promising areas).

KerasTuner uses an optimization strategy reminiscent of Expected Improvement or Upper Confidence Certain to seek out the subsequent set of hyperparameters.

Step 4. Objective Function Evaluation

Evaluate the true, expensive objective function (f(x)) at the brand new candidate point (x_t+1).

The Keras model is trained using the provided training datasets and evaluated on the validation data. We set val_recall as the results of this evaluation.

def fit(self, hp, model=None, *args, **kwargs):
    model = self.construct(hp=hp) if not model else model
    batch_size = hp.Alternative('batch_size', values=[16, 32, 64])
    epochs = hp.Int('epochs', min_value=50, max_value=200, step=50)
  
    return model.fit(
        batch_size=batch_size,
        epochs=epochs,
        class_weight=self.class_weights_dict,
        *args,
        **kwargs
    )

Step 5. Data Update

Add the newly observed data point (x_(t+1), f(x_(t+1))) to the set of observations.

Step 6. Iteration

Repeat Step 2 — 5 until a stopping criterion is met.

Technically, the tuner.search() method orchestrates the complete Bayesian optimization process from Step 2 to five:

tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping_callback]
)

best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
best_keras_model_from_tuner = tuner.get_best_models(num_models=1)[0]

The tactic repeatedly performs these steps until the max_trials limit is reached or other internal stopping criteria reminiscent of early_stopping_callback are met.

Here, we set recall as our key metrics to penalize the misclassification as False Positive costs us probably the most within the fraud detection case.

Results

The Bayesian Optimization process aimed to boost the model’s performance, primarily by maximizing recall.

The tuning efforts yielded a trade-off across key metrics, leading to a model with significantly improved recall on the expense of some precision and overall accuracy in comparison with the initial state:

History of Learning Rate within the Gaussian Optimization Process

Best performing hyperparameter set:

Optimal Neural Network Summary:

Key Performance Metrics:

Recall: The model demonstrated a major improvement in recall, increasing from an initial value of roughly 0.66 (or 0.645) to 0.8400. This means the optimized model is notably higher at identifying positive cases.
Precision: Concurrently, precision experienced a decrease. Ranging from around 0.83 (or 0.81), it settled at 0.6747 post-optimization. This means that while more positive cases are being identified, a better proportion of those identifications may be false positives.
Accuracy: The general accuracy of the model also saw a decline, moving from an initial 0.7640 (or 0.7475) right down to 0.7175. That is consistent with the observed trade-off between recall and precision, where optimizing for one often impacts the others.

Comparing with Grid Search

We tuned a Keras Sequential model with Grid Search on Adam optimizer for comparison:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier

param_grid = {
    'model__learning_rate': [0.001, 0.0005, 0.0001],
    'model__neurons1': [20, 30, 40],
    'model__neurons2': [20, 30, 40],
    'model__dropout_rate1': [0.1, 0.15, 0.2],
    'model__dropout_rate2': [0.1, 0.15, 0.2],
    'batch_size': [16, 32, 64],
    'epochs': [50, 100],
}

input_shape = X_train.shape[1]
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

keras_classifier = KerasClassifier(
    model=create_model,
    model__input_shape=input_shape,
    model__initial_bias_value=initial_bias,
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')
    ]
)

grid_search = GridSearchCV(
    estimator=keras_classifier,
    param_grid=param_grid,
    scoring='recall',
    cv=3,
    n_jobs=-1,
    error_score='raise'
)

grid_result = grid_search.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict
)

optimal_params = grid_result.best_params_
best_keras_classifier = grid_result.best_estimator_

Results

Grid Search tuning resulted in a model with strong precision and good overall accuracy, though with a lower recall in comparison with the Bayesian Optimization approach:

0.82140.77350.7150)0.7100
0.78840.83310.80340.8304
0.8005 0.8092 0.77000.7825

Best performing hyperparameter set:

Optimal Neural Network Summary:

Evaluation During Training (Grid Search Tuning)

Evaluation During Validation (Grid Search Tuning)

Evaluation During Test (Grid Search Tuning)

Grid Search Performance:

Recall: Achieved a recall of 0.7100, a slight decrease from its initial range (0.7735–0.7150).
Precision: Showed robust performance at 0.8304, an improvement over its initial range (0.8331–0.8034).
Accuracy: Settled at 0.7825, maintaining a solid overall predictive capability, barely lower than its initial range (0.8092–0.7700).

Comparison with Bayesian Optimization:

Recall: Bayesian Optimization (0.8400) significantly outperformed Grid Search (0.7100) in identifying positive cases.
Precision: Grid Search (0.8304) achieved much higher precision than Bayesian Optimization (0.6747), indicating fewer false positives.
Accuracy: Grid Search’s accuracy (0.7825) was notably higher than Bayesian Optimization’s (0.7175).

General Comparison with Grid Search

1. Approaching the Search Space

Bayesian Optimization

Intelligent/Adaptive: Bayesian Optimization builds a probabilistic model (often a Gaussian Process) of the target function (e.g., model performance as a function of hyperparameters). It uses this model to predict which hyperparameter combos are more than likely to yield higher results.
Informed: It learns from previous evaluations. After each trial, the probabilistic model is updated, guiding the search towards more promising regions of the hyperparameter space. This permits it to make “intelligent” selections about where to sample next, balancing exploration (trying latest, unknown regions) and exploitation (specializing in regions which have shown good results).
Sequential: It typically operates sequentially, evaluating one point at a time and updating its model before choosing the subsequent.

Grid Search:

Exhaustive/Brute-force: Grid Search systematically tries every possible combination of hyperparameter values from a pre-defined set of values for every hyperparameter. You specify a “grid” of values, and it evaluates every point on that grid.
Uninformed: It doesn’t use the outcomes of previous evaluations to tell the choice of the subsequent set of hyperparameters to try. Each combination is evaluated independently.
Deterministic: Given the identical grid, it should at all times explore the identical combos in the identical order.

2. Computational Cost

Bayesian Optimization

More Efficient: Designed to seek out optimal hyperparameters with significantly fewer evaluations in comparison with Grid Search. This makes it particularly effective when evaluating the target function (e.g., training a Machine Learning model) is computationally expensive or time-consuming.
Scalability: Generally scales higher to higher-dimensional hyperparameter spaces than Grid Search, though it may well still be computationally intensive for very high dimensions on account of the overhead of maintaining and updating the probabilistic model.

Grid Search

Computationally Expensive: Because the variety of hyperparameters and the range of values for every hyperparameter increase, the variety of combos grows exponentially. This results in very future times and high computational cost, making it impractical for giant search spaces. That is sometimes called the “curse of dimensionality.”
Scalability: Doesn’t scale well with high-dimensional hyperparameter spaces.

3. Guarantees and Exploration

Bayesian Optimization

Probabilistic guarantee: It goals to seek out the worldwide optimum efficiently, but it surely does offer a tough guarantee like Grid Search for locating the very best inside a discrete set. As an alternative, it converges probabilistically towards the optimum.
Smarter exploration: Its balance of exploration and exploitation helps it avoid getting stuck in local optima and discover optimal values more effectively.

Grid Search

Guaranteed to seek out best in grid: If the optimal hyperparameters are inside the defined grid, Grid Search is guaranteed to seek out them since it tries every combination.
Limited exploration: It might probably miss optimal values in the event that they fall between the discrete points defined within the grid.

4. When to Use Which

Bayesian Optimization:

Large, high-dimensional hyperparameter spaces: When evaluating models is pricey and you’ve gotten many hyperparameters to tune.
When efficiency is paramount: To seek out good hyperparameters quickly, especially in situations with limited computational resources or time.
Black-box optimization problems: When the target function is complex, non-linear, and doesn’t have a known analytical form.

Grid Search

Small, low-dimensional hyperparameter spaces: When you’ve gotten only just a few hyperparameters and a limited variety of values for every, Grid Search could be a easy and effective alternative.
When exhaustiveness is critical: Should you absolutely must explore each defined combination.

Conclusion

The experiment effectively demonstrated the distinct strengths of Bayesian Optimization and Grid Search in hyperparameter tuning.
Bayesian Optimization, by design, proved highly effective at intelligently navigating the search space and prioritizing a selected objective, on this case, maximizing recall.

It successfully achieved a better recall rate (0.8400) in comparison with Grid Search, indicating its ability to seek out more positive instances.
This capability comes with an inherent trade-off, resulting in reduced precision and overall accuracy.

Such an final result is very priceless in applications where minimizing false negatives is critical (e.g., medical diagnosis, fraud detection).
Its efficiency, stemming from probabilistic modeling that guides the search towards promising areas, makes it a preferred method for optimizing costly experiments or simulations where each evaluation is pricey.

In contrast, Grid Search, while exhaustive, yielded a more balanced model with superior precision (0.8304) and overall accuracy (0.7825).

This means Grid Search was more conservative in its predictions, leading to fewer false positives.

In summary, while Grid Search offers a simple and exhaustive approach, Bayesian Optimization stands out as a more sophisticated and efficient method able to find superior results with fewer evaluations, particularly when optimizing for a selected, often complex, objective like maximizing recall in a high-dimensional space.

The optimal alternative of tuning method ultimately depends upon the particular performance priorities and resource constraints of the applying.

Creator: Kuriko IWAI
Portfolio / LinkedIn / Github
May 26, 2025

All images, unless otherwise noted, are by the writer.
The article utilizes synthetic data, licensed under Apache 2.0 for industrial use.

Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models