Naive Bayes Classification Background: Bayes’ Theorem The Naive Bayes Model Bernoulli Naive Bayes Categorical Naive Bayes Multinomial Naive Bayes Laplace Smoothing Gaussian Naive Bayes Naive Bayes Classifiers in Scikit-Learn Document Classification Example Summary Final Notes

Artificial Intelligence

Naive Bayes Classification Background: Bayes’ Theorem The Naive Bayes Model Bernoulli Naive Bayes Categorical Naive Bayes Multinomial Naive Bayes Laplace Smoothing Gaussian Naive Bayes Naive Bayes Classifiers in Scikit-Learn Document Classification Example Summary Final Notes

admin

June 5, 2023

Naive Bayes Classification
Background: Bayes’ Theorem
The Naive Bayes Model
Bernoulli Naive Bayes
Categorical Naive Bayes
Multinomial Naive Bayes
Laplace Smoothing
Gaussian Naive Bayes
Naive Bayes Classifiers in Scikit-Learn
Document Classification Example
Summary
Final Notes

The event models described above will also be combined in case we have now a heterogenous data set, i.e., an information set that incorporates several types of features (for instance, each categorical and continuous features).

The module sklearn.naive_bayes provides implementations for all of the 4 Naive Bayes classifiers mentioned above:

BernoulliNB implements the Bernoulli Naive Bayes model.
CategoricalNB implements the explicit Naive Bayes model.
MultinomialNB implements the multinomial Naive Bayes model.
GaussianNB implements the Gaussian Naive Bayes model.

The primary three classes accept a parameter called alpha that defines the smoothing parameter (by default it is ready to 1.0).

In the next demonstration we are going to use MultinomialNB to unravel a document classification task. The information set we’re going to use is the 20 newsgroups dataset, which consists of 18,846 newsgroups posts, partitioned (nearly) evenly across 20 different topics. This data set has been widely utilized in research of text applications in machine learning, including document classification and clustering.

Loading the Data Set

You need to use the function fetch_20newsgroups() in Scikit-Learn to download the text documents with their labels. You’ll be able to either download all of the documents as one group, or download the training set and the test set individually (using the subset parameter). The split between the training and the test sets is predicated upon messages posted before or after a particular date.

By default, the text documents contain some metadata reminiscent of headers (e.g., the date of the post), footers (signatures) and quotes to other posts. Since these features are usually not relevant for the text classification task, we are going to strip them out by utilizing the remove parameter:

from sklearn.datasets import fetch_20newsgroupstrain_set = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
test_set = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

Note that the primary time you call this function it might take a couple of minutes to download all of the documents, after which they will likely be cached locally within the folder ~/scikit_learn_data.

The output of the function is a dictionary that incorporates the next attributes:

data — the set of documents
goal — the goal labels
target_names — the names of the document categories

Let’s store the documents and their labels in proper variables:

X_train, y_train = train_set.data, train_set.goal
X_test, y_test = test_set.data, test_set.goal

Data Exploration

Let’s do some basic exploration of the info. The variety of documents we have now within the training and the test sets is:

print('Documents in training set:', len(X_train))
print('Documents in test set:', len(X_test))

Documents in training set: 11314
Documents in test set: 7532

A straightforward calculation shows that 60% of the documents belong to the training set, and 40% to the test set.

Let’s print the list of categories:

categories = train_set.target_names
categories

['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']

As evident, among the categories are closely related to one another (e.g., comp.sys.mac.hardware and comp.sys.ibm.pc.hardware), while others are highly uncorrelated (e.g., sci.electronics and soc.religion.christian).

Finally, let’s examine one in all the documents within the training set (e.g., the primary one):

print(X_train[0])

I used to be wondering if anyone on the market could enlighten me on this automotive I saw
the opposite day. It was a 2-door sports automotive, seemed to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. As well as,
the front bumper was separate from the remainder of the body. That is 
all I do know. If anyone can tellme a model name, engine specs, years
of production, where this automotive is made, history, or whatever info you
have on this funky looking automotive, please e-mail.

Unsurprisingly, the label of this document is:

categories[y_train[0]]

'rec.autos'

Converting Text to Vectors

In an effort to feed text documents into machine learning models, we first have to convert them into vectors of numerical values (i.e., the text). This process typically involves preprocessing and cleansing of the text, after which selecting an acceptable numerical representation for the words within the text.

consists of varied steps, amongst which essentially the most common ones are:

Cleansing and normalizing the text. This includes removing punctuation marks and special characters, and converting the text into lower-case.
Text tokenization, i.e., splitting the text into individual words or terms.
Removal of stop words. Stop words are a set of commonly used words in a given language. For instance, stop words in English include words like “the”, “a”, “is”, “and”. These words are frequently filtered out since they don’t carry useful information.
Stemming or lemmatization. reduces the word to its lexical root by removing or replacing its suffix, while reduces the word to its canonical form (lemma) and likewise takes into consideration the context of the word (its part-of-speech). For instance, the word computers has the lemma computer, but its lexical root is comput.

The next example demonstrates these steps on a given sentence:

After cleansing the text, we’d like to decide on learn how to vectorize it right into a numerical vector. Essentially the most common approaches are:

. On this model, each document is represented by a word counts vector (just like the one we have now utilized in the spam filter example).
(Term Frequency times Inverse Document Frequency) measures how relevant a word is to a document by multiplying two metrics:
(a) TF (Term Frequency) — how persistently the word appears within the document.
(b) IDF (Inverse Document Frequency) — the inverse of the frequency through which the word appears in documents across your complete corpus.
The thought is to diminish the load of words that occur continuously within the corpus, while increasing the load of words that occur rarely (and thus are more indicative of the document’s category).
.On this approach,words are mapped into real-valued vectors in such a way that words with similar meaning have close representation within the vector space. This model is often utilized in deep learning and will likely be discussed in a future post.

Scikit-Learn provides the next two transformers, which support each text preprocessing and vectorization:

CountVectorizer uses the bag-of-words model.
TfIdfVectorizer uses the TF-IDF representation.

Essential hyperparameters of those transformers include:

lowercase — whether to convert all of the characters to lowercase before tokenizing (defaults to True).
token_pattern — the regular expression used to define what’s a token (the default selects tokens of two or more alphanumeric characters).
stop_words — if ‘english’, uses a built-in stop thesaurus for English. If None (the default), no stop words will likely be used. You can even provide your individual custom stop words list.
max_features — if not None, construct a vocabulary that features only the highest max_features with the best term frequency across the training corpus. Otherwise, all of the features are used (that is the default).

Note that these transformers don’t provide advanced preprocessing techniques reminiscent of stemming or lemmatization. To use these techniques, you’ll have to make use of other libraries reminiscent of NLTK (Natural Language Toolkit) or spaCy.

Since Naive Bayes models are known to work higher with TF-IDF representations, we are going to use the TfidfVectorizer to convert the documents within the training set into TF-IDF vectors:

from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)

The form of the extracted TF-IDF vectors is:

print(X_train_vec.shape)

(11314, 101322)

That’s, there are 101,322 unique tokens within the vocabulary of the corpus. We are able to examine these tokens by calling the tactic get_feature_names_out() of the vectorizer:

vocab = vectorizer.get_feature_names_out()
print(vocab[50000:50010]) # pick a subset of the tokens

['innacurate' 'innappropriate' 'innards' 'innate' 'innately' 'inneficient'
'inner' 'innermost' 'innertubes' 'innervation']

Evidently, there was no automatic spell checker back within the 90s 🙂

The TF-IDF vectors are very sparse, with a mean of 67 non-zero components out of greater than 100,000:

print(X_train_vec.nnz / X_train_vec.shape[0])

66.802987449178

Let’s also vectorize the documents within the test set (note that on the test set we call the transform method as an alternative of fit_transform):

X_test_vec = vectorizer.transform(X_test)

Constructing the Model

Let’s now construct a multinomial Naive Bayes classifier and fit it to the training set:

from sklearn.naive_bayes import MultinomialNBclf = MultinomialNB(alpha=0.01)
clf.fit(X_train_vec, y_train)

Note that we’d like to set the smoothing parameter α to a really small number, for the reason that TF-IDF values are scaled to be between 0 and 1, so the default α = 1 would cause a dramatic shift of the values.

Evaluating the Model

Next, let’s evaluate the model on each the training and the test sets.

The accuracy and F1 rating of the model on the training set are:

from sklearn.metrics import f1_scoreaccuracy_train = clf.rating(X_train_vec, y_train)
y_train_pred = clf.predict(X_train_vec)
f1_train = f1_score(y_train, y_train_pred, average='macro')
print(f'Accuracy (train): {accuracy_train:.4f}')
print(f'F1 rating (train): {f1_train:.4f}')

Accuracy (train): 0.9595
F1 rating (train): 0.9622

And the accuracy and F1 rating on the test set are:

accuracy_test = clf.rating(X_test_vec, y_test)
y_test_pred = clf.predict(X_test_vec)
f1_test = f1_score(y_test, y_test_pred, average='macro')print(f'Accuracy (test): {accuracy_test:.4f}')
print(f'F1 rating (test): {f1_test:.4f}')

Accuracy (test): 0.7010
F1 rating (test): 0.6844

The scores on the test set are relatively low in comparison with the training set. To research where the errors come from, let’s plot the confusion matrix of the test documents:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplaycm = confusion_matrix(y_test, y_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
fig, ax = plt.subplots(figsize=(10, 8))
disp.plot(ax=ax, cmap='Blues')

As we are able to see, many of the confusions occur between highly correlated topics, for instance:

74 confusions between topic 0 (alt.atheism) and topic 15 (soc.religion.christian)
92 confusions between topic 18 (talk.politics.misc) and topic 16 (talk.politics.guns)
89 confusions between topic 19 (talk.religion.misc) and topic 15 (soc.religion.christian)

In light of those findings, it appears that evidently the Naive Bayes classifier did a reasonably good job. Let’s examine the way it compares to other standard classification algorithms.

Benchmarking

We are going to benchmark the Naive Bayes model against 4 other classifiers: logistic regression, KNN, random forest and AdaBoost.

Let’s first write a function that gets a set of classifiers and evaluates them on the given data set and likewise measures their training time:

import timedef benchmark(classifiers, names, X_train, y_train, X_test, y_test, verbose=True):
evaluations = []
for clf, name in zip(classifiers, names):
evaluation = {}
evaluation['classifier'] = name
start_time = time.time()
clf.fit(X_train, y_train)
evaluation['training_time'] = time.time() - start_time
evaluation['accuracy'] = clf.rating(X_test, y_test)
y_test_pred = clf.predict(X_test)
evaluation['f1_score'] = f1_score(y_test, y_test_pred, average='macro')
if verbose:
print(evaluation)
evaluations.append(evaluation)
return evaluations

We are going to now call this function with our five classifiers:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifierclassifiers = [clf, LogisticRegression(), KNeighborsClassifier(), RandomForestClassifier(), AdaBoostClassifier()]
names = ['Multinomial NB', 'Logistic Regression', 'KNN', 'Random Forest', 'AdaBoost']
evaluations = benchmark(classifiers, names, X_train_vec, y_train, X_test_vec, y_test)

The output we get is:

{'classifier': 'Multinomial NB', 'training_time': 0.06482672691345215, 'accuracy': 0.7010090281465746, 'f1_score': 0.6844389919212164}
{'classifier': 'Logistic Regression', 'training_time': 39.38498568534851, 'accuracy': 0.6909187466808284, 'f1_score': 0.6778246092753284}
{'classifier': 'KNN', 'training_time': 0.003989696502685547, 'accuracy': 0.08218268720127456, 'f1_score': 0.07567337211476842}
{'classifier': 'Random Forest', 'training_time': 43.847145318984985, 'accuracy': 0.6233404142326076, 'f1_score': 0.6062667217793061}
{'classifier': 'AdaBoost', 'training_time': 6.09197473526001, 'accuracy': 0.36563993627190655, 'f1_score': 0.40123307742451064}

Let’s plot the accuracy and F1 scores of the classifiers:

df = pd.DataFrame(evaluations).set_index('classifier')df['accuracy'].plot.barh()
plt.xlabel('Accuracy (test)')
plt.ylabel('Classifier')

df['f1_score'].plot.barh(color='purple')
plt.xlabel('F1 rating (test)')

Multinomial NB achieves each the best accuracy and F1 scores. Notice that the classifiers have been used with their default parameters with none tuning. For a more fair comparison, the algorithms needs to be compared after advantageous tuning their hyperparameters. As well as, some algorithms reminiscent of KNN suffer from the curse of dimensionality, and dimensionality reduction is required so as to make them work well.

Let’s also plot the training times of the classifiers:

df['training_time'].plot.barh(color='green')
plt.xlabel('Training time (sec)')
plt.ylabel('Classifier')

Training time of the various classifiers

The training of Multinomial NB is so fast that we cannot even see its time within the graph! By examining the function’s output from above, we are able to see that its training time is barely 0.064 seconds. Note that the training of KNN can also be very fast (since no model is definitely built), but its prediction time (not shown) could be very slow.

In conclusion, Multinomial NB has shown superiority over the opposite classifiers in all of the examined criteria.

Finding the Most Informative Features

The Naive Bayes model also allows us to get essentially the most informative features of every class, i.e., the features with the best likelihood P(xⱼ|y).

The MultinomialNB class has an attribute named feature_log_prob_, which provides the log probability of the features for every class in a matrix of shape (n_classes, n_features).

Using this attribute, let’s write a function to search out the ten most informative features (tokens) in each category:

def show_top_n_features(clf, vectorizer, categories, n=10):
feature_names = vectorizer.get_feature_names_out()for i, category in enumerate(categories):       
top_n = np.argsort(clf.feature_log_prob_[i])[-n:]
print(f"{category}: {' '.join(feature_names[top_n])}")

show_top_n_features(clf, vectorizer, categories)

The output we get is:

alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: looking format 3d know program file files thanks image graphics
comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac
comp.windows.x: using windows x11r5 use application thanks widget server motif window
misc.forsale: asking email sell price condition latest shipping offer 00 sale
rec.autos: don ford latest good dealer just engine like cars automotive
rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team 12 months
rec.sport.hockey: league 12 months nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: don thanks voltage used know does like circuit power use
sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg
sci.space: just lunar earth shuttle like moon launch orbit nasa space
soc.religion.christian: consider faith christian christ bible people christians church jesus god
talk.politics.guns: just law firearms government fbi don weapons people guns gun
talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel
talk.politics.misc: know state clinton president just think tax don government people
talk.religion.misc: think don koresh objective christians bible people christian jesus god

Many of the words appear to be strongly correlated with their corresponding category. Nevertheless, there are a couple of generic words reminiscent of “just” and “does” that don’t provide helpful information. This means that our model could also be improved by having a greater stop-words list. Indeed, Scikit-Learn recommends not to make use of its own default list, quoting from its documentation: “There are several known issues with ‘english’ and you must consider an alternate”. 😲

Let’s summarize the professionals and cons of Naive Bayes as in comparison with other classification models:

Extremely fast each in training and prediction
Provides class probability estimates
Will be used each for binary and multi-class classification problems
Requires a small amount of coaching data to estimate its parameters
Highly interpretable
Highly scalable (the variety of parameters is linear within the variety of features)
Works well with high-dimensional data
Robust to noise (the noisy samples are averaged out when estimating the conditional probabilities)
Can take care of missing values (the missing values are ignored when computing the likelihoods of the features)
No hyperparameters to tune (aside from the smoothing parameter, which is never modified)

Relies on the Naive Bayes assumption which doesn’t hold in lots of real-world domains
Correlation between the features can degrade the performance of the model
Generally outperformed by more complex models
The zero frequency problem: if a categorical feature has a category that was not observed within the training set, the model will assign a zero probability to its occurrence. Smoothing alleviates this problem but doesn’t solve it completely.
Cannot handle continuous attributes without discretization or making assumptions on their distribution
Will be used just for classification tasks

That is the longest article I even have written on Medium up to now. I hope you enjoyed reading it at the least as much as I enjoyed writing it. Let me know within the comments if something was not clear.

Yow will discover the code examples of this text on my github: https://github.com/roiyeho/medium/tree/most important/naive_bayes

All images unless otherwise noted are by the creator.

The 20 newsgroups data set info: