Bernoulli Naive Bayes, Explained: A Visual Guide with Code Examples for Beginners

Unlocking Predictive Power Through Binary Simplicity

All illustrations in this text were created by writer, incorporating licensed design elements from Canva Pro.

Unlike the baseline approach of dummy classifiers or the similarity-based reasoning of KNN, Naive Bayes leverages probability theory. It combines the person probabilities of every “clue” (or feature) to make a final prediction. This straightforward yet powerful method has proven invaluable in various machine learning applications.

Naive Bayes is a machine learning algorithm that uses probability to categorise data. It’s based on Bayes’ Theorem, a formula for calculating conditional probabilities. The “naive” part refers to its key assumption: it treats all features as independent of one another, even when they won’t be in point of fact. This simplification, while often unrealistic, greatly reduces computational complexity and works well in lots of practical scenarios.

Naive Bayes methods is an easy algorithms in machine learning using probability as its base.

There are three principal forms of Naive Bayes classifiers. The important thing difference between these types lies in the belief they make in regards to the distribution of features:

Bernoulli Naive Bayes: Suited to binary/boolean features. It assumes each feature is a binary-valued (0/1) variable.
Multinomial Naive Bayes: Typically used for discrete counts. It’s often utilized in text classification, where features is likely to be word counts.
Gaussian Naive Bayes: Assumes that continuous features follow a standard distribution.

Bernoulli NB assumes binary data, Multinomial NB works with discrete counts, and Gaussian NB handles continuous data assuming a standard distribution.

It’s begin to concentrate on the best one which is Bernoulli NB. The “Bernoulli” in its name comes from the belief that every feature is binary-valued.

Throughout this text, we’ll use this artificial golf dataset (inspired by [1]) for example. This dataset predicts whether an individual will play golf based on weather conditions.

Columns: ‘Outlook’, ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ and ‘Play’ (goal feature)

# IMPORTING DATASET #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as npdataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# ONE-HOT ENCODE 'Outlook' COLUMN
df = pd.get_dummies(df, columns=['Outlook'],  prefix='', prefix_sep='', dtype=int)
# CONVERT 'Windy' (bool) and 'Play' (binary) COLUMNS TO BINARY INDICATORS
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
# Set feature matrix X and goal vector y
X, y = df.drop(columns='Play'), df['Play']
# Split the info into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
print(pd.concat([X_train, y_train], axis=1), end='nn')
print(pd.concat([X_test, y_test], axis=1))

We’ll adapt it barely for Bernoulli Naive Bayes by converting our features to binary.

As all the info needs to be in 0 & 1 format, the ‘Outlook’ is one-hot encoded while the Temperature is separated into ≤ 80 and > 80. Similarly, Humidity is separated into ≤ 75 and > 75.

# One-hot encode the categorized columns and drop them after, but do it individually for training and test sets
# Define categories for 'Temperature' and 'Humidity' for training set
X_train['Temperature'] = pd.cut(X_train['Temperature'], bins=[0, 80, 100], labels=['Warm', 'Hot'])
X_train['Humidity'] = pd.cut(X_train['Humidity'], bins=[0, 75, 100], labels=['Dry', 'Humid'])# Similarly, define for the test set
X_test['Temperature'] = pd.cut(X_test['Temperature'], bins=[0, 80, 100], labels=['Warm', 'Hot'])
X_test['Humidity'] = pd.cut(X_test['Humidity'], bins=[0, 75, 100], labels=['Dry', 'Humid'])
# One-hot encode the categorized columns
one_hot_columns_train = pd.get_dummies(X_train[['Temperature', 'Humidity']], drop_first=True, dtype=int)
one_hot_columns_test = pd.get_dummies(X_test[['Temperature', 'Humidity']], drop_first=True, dtype=int)
# Drop the categorized columns from training and test sets
X_train = X_train.drop(['Temperature', 'Humidity'], axis=1)
X_test = X_test.drop(['Temperature', 'Humidity'], axis=1)
# Concatenate the one-hot encoded columns with the unique DataFrames
X_train = pd.concat([one_hot_columns_train, X_train], axis=1)
X_test = pd.concat([one_hot_columns_test, X_test], axis=1)
print(pd.concat([X_train, y_train], axis=1), 'n')
print(pd.concat([X_test, y_test], axis=1))

Bernoulli Naive Bayes operates on data where each feature is either 0 or 1.

Calculate the probability of every class within the training data.
For every feature and sophistication, calculate the probability of the feature being 1 and 0 given the category.
For a brand new instance: For every class, multiply its probability by the probability of every feature value (0 or 1) for that class.
Predict the category with the best resulting probability.

For our golf dataset, a Bernoulli NB classifier have a look at the probability of every feature happening for every class (YES & NO) then make decision based on which class has higher likelihood.

The training process for Bernoulli Naive Bayes involves calculating probabilities from the training data:

Class Probability Calculation: For every class, calculate its probability: (Variety of instances on this class) / (Total variety of instances)

In our golf example, the algorithm would calculate how often golf is played overall.

from fractions import Fractiondef calc_target_prob(attr):
total_counts = attr.value_counts().sum()
prob_series = attr.value_counts().apply(lambda x: Fraction(x, total_counts).limit_denominator())
return prob_series
print(calc_target_prob(y_train))

2.Feature Probability Calculation: For every feature and every class, calculate:

(Variety of instances where feature is 0 on this class) / (Variety of instances on this class)
(Variety of instances where feature is 1 on this class) / (Variety of instances on this class)

For every weather condition (e.g., sunny), how often golf is played when it’s sunny and the way often it’s not played when it’s sunny.

from fractions import Fractiondef sort_attr_label(attr, lbl):
return (pd.concat([attr, lbl], axis=1)
.sort_values([attr.name, lbl.name])
.reset_index()
.rename(columns={'index': 'ID'})
.set_index('ID'))
def calc_feature_prob(attr, lbl):
total_classes = lbl.value_counts()
counts = pd.crosstab(attr, lbl)
prob_df = counts.apply(lambda x: [Fraction(c, total_classes[x.name]).limit_denominator() for c in x])
return prob_df
print(sort_attr_label(y_train, X_train['sunny']))
print(calc_feature_prob(X_train['sunny'], y_train))

The identical process is applied to the entire other features.

for col in X_train.columns:
print(calc_feature_prob(X_train[col], y_train), "n")

3. Smoothing (Optional): Add a small value (often 1) to the numerator and denominator of every probability calculation to avoid zero probabilities

We add 1 to all numerators, and add 2 to all denominators, to maintain the whole class probability 1.

# In sklearn, all processes above is summarized on this 'fit' method:
from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB(alpha=1)
nb_clf.fit(X_train, y_train)

4. Store Results: Save all calculated probabilities to be used during classification.

Smoothing is already applied to all feature probabilities. We’ll use these tables to make predictions.

Given a brand new instance with features which might be either 0 or 1:

Probability Collection: For every possible class:

Start with the probability of this class occurring (class probability).
For every feature in the brand new instance, collect the probability of this feature being 0/1 for this class.

For ID 14, we select the chances of every of the feature (either 0 or 1) happening.

2. Rating Calculation & Prediction: For every class:

Multiply all of the collected probabilities together
The result’s the rating for this class
The category with the best rating is the prediction

After multiplying the category probability and the entire feature probabilities, we select the category that has the upper rating.

y_pred = nb_clf.predict(X_test)
print(y_pred)

This straightforward probabilistic model give an ideal accuracy for this easy dataset.

# Evaluate the classifier
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Bernoulli Naive Bayes has a number of essential parameters:

Alpha (α): That is the smoothing parameter. It adds a small count to every feature to stop zero probabilities. Default is normally 1.0 (Laplace smoothing) as what was shown before.
Binarize: In case your features aren’t already binary, this threshold converts them. Any value above this threshold becomes 1, and any value below becomes 0.

For BernoulliNB in scikit-learn, numerical features are sometimes standardized quite than manually binarized. The model then internally converts these standardized values to binary, often using 0 (the mean) as the brink.

3. Fit Prior: Whether to learn class prior probabilities or assume uniform priors (50/50).

For our golf dataset, we would start with the default α=1.0, no binarization (since we’ve already made our features binary), and fit_prior=True.

Like several algorithm in machine learning, Bernoulli Naive Bayes has its strengths and limitations.

Simplicity: Easy to implement and understand.
Efficiency: Fast to coach and predict, works well with large feature spaces.
Performance with Small Datasets: Can perform well even with limited training data.
Handles High-Dimensional Data: Works well with many features, especially in text classification.

Independence Assumption: Assumes all features are independent, which is commonly not true in real-world data.
Limited to Binary Features: In its pure form, only works with binary data.
Sensitivity to Input Data: May be sensitive to how the features are binarized.
Zero Frequency Problem: Without smoothing, zero probabilities can strongly affect predictions.

The Bernoulli Naive Bayes classifier is an easy yet powerful machine learning algorithm for binary classification. It excels in text evaluation and spam detection, where features are typically binary. Known for its speed and efficiency, this probabilistic model performs well with small datasets and high-dimensional spaces.

Despite its naive assumption of feature independence, it often rivals more complex models in accuracy. Bernoulli Naive Bayes serves as a superb baseline and real-time classification tool.

# Import needed libraries
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# Prepare data for model
df = pd.get_dummies(df, columns=['Outlook'],  prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
# Split data into training and testing sets
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical features (for automatic binarization)
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])
# Train the model
nb_clf = BernoulliNB()
nb_clf.fit(X_train, y_train)
# Make predictions
y_pred = nb_clf.predict(X_test)
# Check accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Bernoulli Naive Bayes, Explained: A Visual Guide with Code Examples for Beginners

Unlocking Predictive Power Through Binary Simplicity

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Agent Handoffs Work in Multi-Agent Systems

NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar than Previous-Gen Architecture

Protect AI + Hugging Face 6 Months In

OpenAI releases GPT-5.2 after “code red” Google threat alert

Start constructing with Gemini 2.0 Flash and Flash-Lite

Bernoulli Naive Bayes, Explained: A Visual Guide with Code Examples for Beginners

Unlocking Predictive Power Through Binary Simplicity

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.