Mastering Monte Carlo: Simulate Your Method to Higher Machine Learning Models

Artificial Intelligence

Mastering Monte Carlo: Simulate Your Method to Higher Machine Learning Models

admin

August 3, 2023

Mastering Monte Carlo: Simulate Your Method to Higher Machine Learning Models

Within the Monte Carlo method, the pi estimate relies on the proportion of “darts” that land contained in the circle to the overall variety of darts thrown. The resulting estimated pi value is used to generate a circle. If the Monte Carlo estimate is inaccurate, the circle will again be the flawed size. The width of the gap between this estimated circle and the unit circle gives a sign of the accuracy of the Monte Carlo estimate.

Nevertheless, since the Monte Carlo method generates more accurate estimates because the variety of “darts” increases, the estimated circle should converge towards the unit circle as more “darts” are thrown. Subsequently, while each methods show a niche when the estimate is inaccurate, this gap should decrease more consistently with the Monte Carlo method because the variety of “darts” increases.

What makes Monte Carlo simulations so powerful is their ability to harness randomness to resolve deterministic problems. By generating a lot of random scenarios and analyzing the outcomes, we will estimate the probability of various outcomes, even for complex problems that may be difficult to resolve analytically.

Within the case of estimating pi, the Monte Carlo method allows us to make a really accurate estimate, although we’re just throwing darts randomly. As discussed, the more darts we throw, the more accurate our estimate becomes. That is an illustration of the law of huge numbers, a fundamental concept in probability theory that states that the common of the outcomes obtained from a lot of trials ought to be near the expected value, and can are inclined to turn out to be closer and closer as more trials are performed. Let’s see if this tends to be true for our six examples shown in Figures 2a-2f by plotting the variety of darts thrown against the difference between Monte Carlo-estimated pi and real pi. On the whole, our graph (Figure 2g) should trend negative. Here’s the code to perform this:

# Calculate the differences between the actual pi and the estimated pi
diff_pi = [abs(estimate - math.pi) for estimate in pi_estimates]# Create the figure for the variety of darts vs difference in pi plot (Figure 2g)
fig2g = go.Figure(data=go.Scatter(x=num_darts_thrown, y=diff_pi, mode='lines'))
# Add title and labels to the plot
fig2g.update_layout(
title="Fig2g: Darts Thrown vs Difference in Estimated Pi",
xaxis_title="Variety of Darts Thrown",
yaxis_title="Difference in Pi",
)
# Display the plot
fig2g.show()
# Save the plot as a png
pio.write_image(fig2g, "fig2g.png")

Note that, even with only 6 examples, the overall pattern is as expected: more darts thrown (more scenarios), a smaller difference between the estimated and real value, and thus a greater prediction.

Let’s say we throw 1,000,000 total darts, and permit ourselves 500 predictions. In other words, we’ll record the difference between the estimated and actual values of pi at 500 evenly spaced intervals throughout the simulation of 1,000,000 thrown darts. Slightly than generate 500 extra figures, let’s just skip to what we’re trying to verify: whether it’s indeed true that as more darts are thrown, the difference in our predicted value of pi and real pi gets lower. We’ll use a scatter plot (Figure 2h):

#500 Monte Carlo Scenarios; 1,000,000 darts thrown
import random
import math
import plotly.graph_objects as go
import numpy as np# Total variety of darts to throw (1M)
num_darts = 1000000
darts_in_circle = 0
# Variety of scenarios to record (500)
num_scenarios = 500
darts_per_scenario = num_darts // num_scenarios
# Lists to store the information for every scenario
darts_thrown_list = []
pi_diff_list = []
# We'll throw a variety of darts
for i in range(num_darts):
# Generate random x, y coordinates between -1 and 1
x, y = random.uniform(-1, 1), random.uniform(-1, 1)
# Check if the dart is contained in the circle
# A dart is contained in the circle if the gap from the origin (0,0) is lower than or equal to 1
if math.sqrt(x**2 + y**2) <= 1:
darts_in_circle += 1
# If it is time to record a scenario
if (i + 1) % darts_per_scenario == 0:
# Estimate pi with Monte Carlo method
# The estimate is 4 times the variety of darts within the circle divided by the overall variety of darts
pi_estimate = 4 * darts_in_circle / (i + 1)
# Record the variety of darts thrown and the difference between the estimated and actual values of pi
darts_thrown_list.append((i + 1) / 1000)  # Dividing by 1000 to display in 1000's
pi_diff_list.append(abs(pi_estimate - math.pi))
# Create a scatter plot of the information
fig = go.Figure(data=go.Scattergl(x=darts_thrown_list, y=pi_diff_list, mode='markers'))
# Update the layout of the plot
fig.update_layout(
title="Fig2h: Difference between Estimated and Actual Pi vs. Variety of Darts Thrown (in 1000's)",
xaxis_title="Variety of Darts Thrown (in 1000's)",
yaxis_title="Difference between Estimated and Actual Pi",
)
# Display the plot
fig.show()
# Save the plot as a png
pio.write_image(fig2h, "fig2h.png")

You may be pondering to yourself at this point, “Monte Carlo is an interesting statistical tool, but how does it apply to machine learning?” The short answer is: in some ways. One in every of the various applications of Monte Carlo simulations in machine learning is within the realm of hyperparameter tuning.

Hyperparameters are the knobs and dials that we (the humans) adjust when establishing machine learning algorithms. They control elements of the algorithm’s behavior that, crucially, aren’t learned from the information. For instance, in a choice tree, the utmost depth of the tree is a hyperparameter. In a neural network, the educational rate and the variety of hidden layers are hyperparameters.

Selecting the fitting hyperparameters could make the difference between a model that performs poorly and one which performs excellently. But how can we know which hyperparameters to decide on? That is where Monte Carlo simulations are available in.

Traditionally, machine learning practitioners have used methods like grid search or random search to tune hyperparameters. These methods involve specifying a set of possible values for every hyperparameter, after which training and evaluating a model for each possible combination of hyperparameters. This will be computationally expensive and time-consuming, especially when there are lots of hyperparameters to tune or a wide range of possible values each can take.

Monte Carlo simulations offer a more efficient alternative. As an alternative of exhaustively looking through all possible combos of hyperparameters, we will randomly sample from the space of hyperparameters in accordance with some probability distribution. This enables us to explore the hyperparameter space more efficiently and find good combos of hyperparameters faster.

In the subsequent section, we’ll use an actual dataset to show the way to use Monte Carlo simulations for hyperparameter tuning in practice. Let’s start!

The Heartbeat of Our Experiment: The Heart Disease Dataset

On the earth of machine learning, data is the lifeblood that powers our models. For our exploration of Monte Carlo simulations in hyperparameter tuning, let’s take a look at a dataset that’s near the center — quite literally. The Heart Disease dataset (CC BY 4.0) from the UCI Machine Learning Repository is a group of medical records from patients, a few of whom have heart disease.

The dataset accommodates 14 attributes, including age, sex, chest pain type, resting blood pressure, levels of cholesterol, fasting blood sugar, and others. The goal variable is the presence of heart disease, making this a binary classification task. With a mixture of categorical and numerical features, it’s an interesting dataset for demonstrating hyperparameter tuning.

First, let’s take a take a look at our dataset to get a way of what we’ll be working with — all the time an excellent place to start out.


#Load and examine first few rows of dataset# Import mandatory libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import numpy as np
import plotly.graph_objects as go
# Load the dataset
# The dataset is accessible on the UCI Machine Learning Repository
# It is a dataset about heart disease and includes various patient measurements
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# Define the column names for the dataframe
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
# Load the dataset right into a pandas dataframe
# We specify the column names and in addition tell pandas to treat '?' as NaN
df = pd.read_csv(url, names=column_names, na_values="?")
# Print the primary few rows of the dataframe
# This provides us a fast overview of the information
print(df.head())

This shows us the primary 4 values in our dataset across all columns. In the event you’ve loaded the fitting csv and named your columns as I actually have, your output will appear like Figure 3.

Figure 3: First 4 rows of knowledge from our dataset

Before we will use the Heart Disease dataset for hyperparameter tuning, we’d like to preprocess the information. This involves several steps:

Handling missing values: Some records within the dataset have missing values. We’ll need to determine the way to handle these, whether by deleting the records, filling within the missing values, or another method.
Encoding categorical variables: Many machine learning algorithms require input data to be numerical. We’ll have to convert categorical variables right into a numerical format.
Normalizing numerical features: Machine learning algorithms often perform higher when numerical features are on an analogous scale. We’ll apply normalization to regulate the dimensions of those features.

Let’s start by handling missing values. In our Heart Disease dataset, we’ve just a few missing values within the ‘ca’ and ‘thal’ columns. We’ll fill these missing values with the median of the respective column. This can be a common strategy for coping with missing data, because it doesn’t drastically affect the distribution of the information.

Next, we’ll encode the explicit variables. In our dataset, the ‘cp’, ‘restecg’, ‘slope’, ‘ca’, and ‘thal’ columns are categorical. We’ll use label encoding to convert these categorical variables into numerical ones. Label encoding assigns each unique category in a column to a special integer.

Finally, we’ll normalize the numerical features. Normalization adjusts the dimensions of numerical features in order that all of them fall inside an analogous range. This may also help improve the performance of many machine learning algorithms. We’ll use standard scaling for normalization, which transforms the information to have a mean of 0 and a regular deviation of 1.

Here’s the Python code that performs all of those preprocessing steps:

# Preprocess# Import mandatory libraries
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
# Discover missing values within the dataset
# It will print the variety of missing values in each column
print(df.isnull().sum())
# Fill missing values with the median of the column
# The SimpleImputer class from sklearn provides basic strategies for imputing missing values
# We're using the 'median' strategy, which replaces missing values with the median of every column
imputer = SimpleImputer(strategy='median')
# Apply the imputer to the dataframe
# The result's a latest dataframe where missing values have been filled in
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
# Print the primary few rows of the filled dataframe
# This provides us a fast check to ensure the imputation worked accurately
print(df_filled.head())
# Discover categorical variables within the dataset
# These are variables that contain non-numerical data
categorical_vars = df_filled.select_dtypes(include='object').columns
# Encode categorical variables
# The LabelEncoder class from sklearn converts each unique string into a singular integer
encoder = LabelEncoder()
for var in categorical_vars:
df_filled[var] = encoder.fit_transform(df_filled[var])
# Normalize numerical features
# The StandardScaler class from sklearn standardizes features by removing the mean and scaling to unit variance
scaler = StandardScaler()
# Apply the scaler to the dataframe
# The result's a latest dataframe where numerical features have been normalized
df_normalized = pd.DataFrame(scaler.fit_transform(df_filled), columns=df_filled.columns)
# Print the primary few rows of the normalized dataframe
# This provides us a fast check to ensure the normalization worked accurately
print(df_normalized.head())

The primary print statement shows us the variety of missing values in each column of the unique dataset. In our case, the ‘ca’ and ‘thal’ columns had just a few missing values.

The second print statement shows us the primary few rows of the dataset after filling within the missing values. As discussed, we used the median of every column to fill within the missing values.

The third print statement shows us the primary few rows of the dataset after encoding the explicit variables. After this step, all of the variables in our dataset are numerical.

The ultimate print statement shows us the primary few rows of the dataset after normalizing the numerical features, by which the information could have a mean of 0 and a regular deviation of 1. After this step, all of the numerical features in our dataset are on an analogous scale. Check that your output resembles Figure 4:

Figure 4: Preprocessing Print Statements Output

After running this code, we’ve a preprocessed dataset that’s ready for modeling.

Now that we’ve preprocessed our data, we’re able to implement a basic machine learning model. It will function our baseline model, which we’ll later try to enhance through hyperparameter tuning.

We’ll use a straightforward logistic regression model for this task. Note that while it’s called “regression,” this is definitely one of the popular algorithms for binary classification problems, just like the one we’re coping with within the Heart Disease dataset. It’s a linear model that predicts the probability of the positive class.

After training our model, we’ll evaluate its performance using two common metrics: accuracy and ROC-AUC. Accuracy is the proportion of correct predictions out of all predictions, while ROC-AUC (Receiver Operating Characteristic — Area Under Curve) measures the trade-off between the true positive rate and the false positive rate.

But what does this need to do with Monte Carlo simulations? Well, machine learning models like logistic regression have several hyperparameters that will be tuned to enhance performance. Nevertheless, finding one of the best set of hyperparameters will be like trying to find a needle in a haystack. That is where Monte Carlo simulations are available in. By randomly sampling different sets of hyperparameters and evaluating their performance, we will estimate the probability distribution of fine hyperparameters and make an informed guess about one of the best ones to make use of, similarly to how we picked higher values of pi in our dart-throwing exercise.

Here’s the Python code that implements and evaluates a basic logistic regression model for our newly pre-processed data:

# Logistic Regression Model - Baseline# Import mandatory libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# Replace the 'goal' column within the normalized DataFrame with the unique 'goal' column
# This is completed since the 'goal' column was also normalized, which shouldn't be what we wish
df_normalized['target'] = df['target']
# Binarize the 'goal' column
# This is completed because the unique 'goal' column accommodates values from 0 to 4
# We would like to simplify the issue to a binary classification problem: heart disease or no heart disease
df_normalized['target'] = df_normalized['target'].apply(lambda x: 1 if x > 0 else 0)
# Split the information into training and test sets
# The 'goal' column is our label, so we drop it from our features (X)
# We use a test size of 20%, meaning 80% of the information will probably be used for training and 20% for testing
X = df_normalized.drop('goal', axis=1)
y = df_normalized['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Implement a basic logistic regression model
# Logistic Regression is an easy yet powerful linear model for binary classification problems
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
# The model has been trained, so we will now use it to make predictions on unseen data
y_pred = model.predict(X_test)
# Evaluate the model
# We use accuracy (the proportion of correct predictions) and ROC-AUC (a measure of how well the model distinguishes between classes) as our metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
# Print the performance metrics
# These give us a sign of how well our model is performing
print("Baseline Model " + f'Accuracy: {accuracy}')
print("Baseline Model " + f'ROC-AUC: {roc_auc}')

With an accuracy of 0.885 and an ROC-AUC rating of 0.884, our basic logistic regression model has set a solid baseline for us to enhance upon. These metrics indicate that our model is performing quite well at distinguishing between patients with and without heart disease. Let’s see if we will make it higher.

In machine learning, a model’s performance can often be improved by tuning its hyperparameters. Hyperparameters are parameters that aren’t learned from the information, but are set prior to the beginning of the educational process. For instance, in logistic regression, the regularization strength ‘C’ and the style of penalty ‘l1’ or ‘l2’ are hyperparameters.

Let’s perform hyperparameter tuning on our logistic regression model using grid search. We’ll tune the ‘C’ and ‘penalty’ hyperparameters, and we’ll use ROC-AUC as our scoring metric. Let’s see if we will beat our baseline model’s performance.

Now, let’s start with the Python code for this section.

# Grid Search# Import mandatory libraries
from sklearn.model_selection import GridSearchCV
# Define the hyperparameters and their values
# 'C' is the inverse of regularization strength (smaller values specify stronger regularization)
# 'penalty' specifies the norm utilized in the penalization (l1 or l2)
hyperparameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 
'penalty': ['l1', 'l2']}
# Implement grid search
# GridSearchCV is a technique used to tune our model's hyperparameters
# We pass our model, the hyperparameters to tune, and the variety of folds for cross-validation
# We're using ROC-AUC as our scoring metric
grid_search = GridSearchCV(LogisticRegression(), hyperparameters, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)
# Get one of the best hyperparameters
# GridSearchCV has found one of the best hyperparameters for our model, so we print them out
best_params = grid_search.best_params_
print(f'Best hyperparameters: {best_params}')
# Evaluate one of the best model
# GridSearchCV also gives us one of the best model, so we will use it to make predictions and evaluate its performance
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
roc_auc_best = roc_auc_score(y_test, y_pred_best)
# Print the performance metrics of one of the best model
# These give us a sign of how well our model is performing after hyperparameter tuning
print("Grid Search Method " + f'Accuracy of one of the best model: {accuracy_best}')
print("Grid Search Method " + f'ROC-AUC of one of the best model: {roc_auc_best}')

With one of the best hyperparameters found to be {‘C’: 0.1, ‘penalty’: ‘l2’}, our grid search has an accuracy of 0.852 and an ROC-AUC rating of 0.853 for one of the best model. Interestingly, this performance is barely lower than our baseline model. This might be attributable to the incontrovertible fact that our baseline model’s hyperparameters were already well-suited to this particular dataset, or it might be a results of the randomness inherent within the train-test split. Regardless, it’s a priceless reminder that more complex models and techniques aren’t all the time higher.

Nevertheless, you may have also noticed that our grid search only explored a comparatively small variety of possible hyperparameter combos. In practice, the variety of hyperparameters and their potential values will be much larger, making grid search computationally expensive and even infeasible.

That is where the Monte Carlo method is available in. Let’s see if this more guided approach improves on either the unique baseline or grid search-based model’s performance:

#Monte Carlo# Import mandatory libraries
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Split the information into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the range of hyperparameters
# 'C' is the inverse of regularization strength (smaller values specify stronger regularization)
# 'penalty' specifies the norm utilized in the penalization (l1 or l2)
C_range = np.logspace(-3, 3, 7)
penalty_options = ['l1', 'l2']
# Initialize variables to store one of the best rating and hyperparameters
best_score = 0
best_hyperparams = None
# Perform the Monte Carlo simulation
# We will perform 1000 iterations. You may play with this number to see how the performance changes.
# Remember the Law of Large Numbers!
for _ in range(1000):  
# Randomly select hyperparameters from the defined range
C = np.random.alternative(C_range)
penalty = np.random.alternative(penalty_options)
# Create and evaluate the model with these hyperparameters
# We're using 'liblinear' solver because it supports each L1 and L2 regularization
model = LogisticRegression(C=C, penalty=penalty, solver='liblinear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate the accuracy and ROC-AUC
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
# If this model's ROC-AUC is one of the best up to now, store its rating and hyperparameters
if roc_auc > best_score:
best_score = roc_auc
best_hyperparams = {'C': C, 'penalty': penalty}
# Print one of the best rating and hyperparameters
print("Monte Carlo Method " + f'Best ROC-AUC: {best_score}')
print("Monte Carlo Method " + f'Best hyperparameters: {best_hyperparams}')
# Train the model with one of the best hyperparameters
best_model = LogisticRegression(**best_hyperparams, solver='liblinear')
best_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = best_model.predict(X_test)
# Calculate and print the accuracy of one of the best model
accuracy = accuracy_score(y_test, y_pred)
print("Monte Carlo Method " + f'Accuracy of one of the best model: {accuracy}')

Within the Monte Carlo method, we found that one of the best ROC-AUC rating was 0.9014, with one of the best hyperparameters being {‘C’: 0.1, ‘penalty’: ‘l1’}. The accuracy of one of the best model was 0.9016.

Looks like Monte Carlo just pulled an ace from the deck — that is an improvement over each the baseline model and the model tuned using grid search. I encourage you to tweak the Python code to see the way it impacts the performance, remembering the principles discussed. See in the event you can improve the grid search method by increasing the hyperparameter space, or compare the computation time to the Monte Carlo method. Increase and reduce the variety of iterations for our Monte Carlo method to see how that impacts performance.

The Monte Carlo method, born from a game of solitaire, has undoubtedly reshaped the landscape of computational mathematics and data science. Its power lies in its simplicity and flexibility, allowing us to tackle complex, high-dimensional problems with relative ease. From estimating the worth of pi with a game of darts to tuning hyperparameters in machine learning models, Monte Carlo simulations have proven to be a useful tool in our data science arsenal.

In this text, we’ve journeyed from the origins of the Monte Carlo method, through its theoretical underpinnings, and into its practical applications in machine learning. We’ve seen how it will possibly be used to optimize machine learning models, with a hands-on exploration of hyperparameter tuning using a real-world dataset. We’ve also compared it with other methods, demonstrating its efficiency and effectiveness.

However the story of Monte Carlo is removed from over. As we proceed to push the boundaries of machine learning and data science, the Monte Carlo method will undoubtedly proceed to play an important role. Whether we’re developing sophisticated AI applications, making sense of complex data, or just playing a game of solitaire, the Monte Carlo method is a testament to the ability of simulation and approximation in solving complex problems.

As we move forward, let’s take a moment to understand the great thing about this method — a way that has its roots in a straightforward card game, yet has the ability to drive a number of the most advanced computations on this planet. The Monte Carlo method truly is a high-stakes game of likelihood and complexity, and up to now, it seems, the home all the time wins. So, keep shuffling the deck, keep playing your cards, and remember — in the sport of knowledge science, Monte Carlo could just be your ace in the opening.

Congratulations on making it to the tip! We’ve journeyed through the world of probabilities, wrestled with complex models, and emerged with a newfound appreciation for the ability of Monte Carlo simulations. We’ve seen them in motion, simplifying intricate problems into manageable components, and even optimizing hyperparameters for machine learning tasks.

In the event you enjoy diving into the intricacies of ML problem-solving as much as I do, follow me on Medium and LinkedIn. Together, let’s navigate the AI labyrinth, one clever solution at a time.

Until our next statistical adventure, keep exploring, continue to learn, and keep simulating! And in your data science and ML journey, may the chances be ever in your favor.

Note: All images, unless otherwise noted, are by the writer.

The Heartbeat of Our Experiment: The Heart Disease Dataset

LEAVE A REPLY Cancel reply