Study | StudyLover

Bernoulli Naive Bayes(BernoulliNB)

Gaussian Naive Bayes (GaussianNB) : Decision Tree

Unit 2: Guide to Machine Learning Algorithms

Bernoulli Naive Bayes is a specialized version of the Naive Bayes algorithm designed for features that are binary, meaning they can only have two values, typically 0 and 1. In the context of text classification, this translates to tracking the presence or absence of a word in a document, rather than how many times the word appears.

How it Differs from Multinomial Naive Bayes

This is the most crucial distinction to understand. Let's look at the two algorithms with an example sentence: "A great, great movie."

MultinomialNB (which you used in the other artifact) cares about word counts. It would see the feature great as having a value of 2. It's a good choice when the frequency of words is important.
BernoulliNB, as used in the bernoulli_nb_deep_dive artifact, only cares about word presence. It would see the feature great as having a value of 1 (present). It doesn't matter that the word appeared twice. This is why the code uses CountVectorizer(binary=True)—it forces the vectorizer to output only 0s and 1s.

This makes BernoulliNB particularly useful for tasks where the simple presence of a word is a strong signal, such as in short texts like tweets or product reviews.

How it Works: A Step-by-Step Explanation

The algorithm works by calculating probabilities based on the presence or absence of words.

1. Feature Binarization: The first step, as seen in your code, is to convert the text documents into binary feature vectors. The CountVectorizer(binary=True) creates a matrix where each row is a review and each column is a unique word from the vocabulary. A cell contains a 1 if the word is present in that review and a 0 if it is not.

2. Probability Calculation (Training): During the .fit() step, the model learns two key probabilities from the training data:

o The prior probability of each class (e.g., the overall chance of a review being positive vs. negative).

o The conditional probability of a word being present, given a class. For example, it calculates P(word='great' is present | class='positive').

3. Prediction: When classifying a new review, the model uses Bayes' theorem. For each class (positive and negative), it calculates a score by multiplying the probabilities of all the words that are present in the new review. It also considers the words from the vocabulary that are absent in the new review. The class that results in the highest final probability score is chosen as the prediction.

# --- 1. Import Necessary Libraries ---

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import BernoulliNB

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import seaborn as sns

import matplotlib.pyplot as plt

# --- 2. Prepare Sample Data: Sentiment Analysis ---

# Let's classify short reviews as 'positive' or 'negative'.

data = {

'review': [

'This movie was great and amazing',

'I really enjoyed this film',

'A fantastic and wonderful experience',

'I loved the acting and the plot',

'This was a terrible and awful movie',

'I hated this film, it was boring',

'A completely dreadful experience',

'The plot was bad and the acting was poor'

'sentiment': [

'positive', 'positive', 'positive', 'positive',

'negative', 'negative', 'negative', 'negative'

]

}

df = pd.DataFrame(data)

# --- 3. Feature Extraction: Binarizing Text Features ---

# We use CountVectorizer but set `binary=True`. This is the key step for BernoulliNB.

# It will output 1 if a word is present and 0 otherwise, ignoring frequency.

vectorizer = CountVectorizer(binary=True, stop_words='english')

X = vectorizer.fit_transform(df['review'])

y = df['sentiment']

# Split the data for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# --- 4. Create and Train the Bernoulli Naive Bayes Model ---

# Initialize the classifier

model = BernoulliNB()

# Train the model on the binary feature vectors

model.fit(X_train, y_train)

print("--- Model Training Complete ---")

# --- 5. Make Predictions and Evaluate the Model ---

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred, labels=['positive', 'negative'])

class_report = classification_report(y_test, y_pred)

print(f"\nModel Accuracy: {accuracy * 100:.2f}%")

print("\n--- Classification Report ---")

print(class_report)

# Visualize the Confusion Matrix

plt.figure(figsize=(6, 5))

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens',

xticklabels=['positive', 'negative'], yticklabels=['positive', 'negative'])

plt.xlabel('Predicted Label')

plt.ylabel('True Label')

plt.title('Confusion Matrix for BernoulliNB')

plt.show()

# --- 6. Predict on New, Unseen Reviews ---

new_reviews = [

"The movie was wonderful and I loved it",

"A bad and boring film"

]

# Transform the new reviews using the same binarizing vectorizer

new_reviews_transformed = vectorizer.transform(new_reviews)

# Make predictions

new_predictions = model.predict(new_reviews_transformed)

print("\n--- Predictions for New Reviews ---")

for review, prediction in zip(new_reviews, new_predictions):

print(f"Review: '{review}' ==> Predicted Sentiment: '{prediction.upper()}'")

Gaussian Naive Bayes (GaussianNB) Decision Tree