Bernoulli Naive Bayes is a specialized version of the Naive Bayes algorithm designed for features that are binary, meaning they can only have two values, typically 0 and 1. In the context of text classification, this translates to tracking the presence or absence of a word in a document, rather than how many times the word appears.
How it Differs from Multinomial Naive Bayes
This is the most crucial distinction to understand. Let's look at the two algorithms with an example sentence: "A great, great movie."
- MultinomialNB (which you used in the other artifact) cares about word counts. It would see the feature great as having a value of 2. It's a good choice when the frequency of words is important.
- BernoulliNB, as used in the bernoulli_nb_deep_dive artifact, only cares about word presence. It would see the feature great as having a value of 1 (present). It doesn't matter that the word appeared twice. This is why the code uses CountVectorizer(binary=True)—it forces the vectorizer to output only 0s and 1s.
This makes BernoulliNB particularly useful for tasks where the simple presence of a word is a strong signal, such as in short texts like tweets or product reviews.
How it Works: A Step-by-Step Explanation
The algorithm works by calculating probabilities based on the presence or absence of words.
1. Feature Binarization: The first step, as seen in your code, is to convert the text documents into binary feature vectors. The CountVectorizer(binary=True) creates a matrix where each row is a review and each column is a unique word from the vocabulary. A cell contains a 1 if the word is present in that review and a 0 if it is not.
2. Probability Calculation (Training): During the .fit() step, the model learns two key probabilities from the training data:
o The prior probability of each class (e.g., the overall chance of a review being positive vs. negative).
o The conditional probability of a word being present, given a class. For example, it calculates P(word='great' is present | class='positive').
3. Prediction: When classifying a new review, the model uses Bayes' theorem. For each class (positive and negative), it calculates a score by multiplying the probabilities of all the words that are present in the new review. It also considers the words from the vocabulary that are absent in the new review. The class that results in the highest final probability score is chosen as the prediction.
# --- 1. Import Necessary Libraries ---
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
# --- 2. Prepare Sample Data: Sentiment Analysis ---
# Let's classify short reviews as 'positive' or 'negative'.
data = {
'review': [
'This movie was great and amazing',
'I really enjoyed this film',
'A fantastic and wonderful experience',
'I loved the acting and the plot',
'This was a terrible and awful movie',
'I hated this film, it was boring',
'A completely dreadful experience',
'The plot was bad and the acting was poor'
],
'sentiment': [
'positive', 'positive', 'positive', 'positive',
'negative', 'negative', 'negative', 'negative'
]
}
df = pd.DataFrame(data)
# --- 3. Feature Extraction: Binarizing Text Features ---
# We use CountVectorizer but set `binary=True`. This is the key step for BernoulliNB.
# It will output 1 if a word is present and 0 otherwise, ignoring frequency.
vectorizer = CountVectorizer(binary=True, stop_words='english')
X = vectorizer.fit_transform(df['review'])
y = df['sentiment']
# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# --- 4. Create and Train the Bernoulli Naive Bayes Model ---
# Initialize the classifier
model = BernoulliNB()
# Train the model on the binary feature vectors
model.fit(X_train, y_train)
print("--- Model Training Complete ---")
# --- 5. Make Predictions and Evaluate the Model ---
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred, labels=['positive', 'negative'])
class_report = classification_report(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")
print("\n--- Classification Report ---")
print(class_report)
# Visualize the Confusion Matrix
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens',
xticklabels=['positive', 'negative'], yticklabels=['positive', 'negative'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for BernoulliNB')
plt.show()
# --- 6. Predict on New, Unseen Reviews ---
new_reviews = [
"The movie was wonderful and I loved it",
"A bad and boring film"
]
# Transform the new reviews using the same binarizing vectorizer
new_reviews_transformed = vectorizer.transform(new_reviews)
# Make predictions
new_predictions = model.predict(new_reviews_transformed)
print("\n--- Predictions for New Reviews ---")
for review, prediction in zip(new_reviews, new_predictions):
print(f"Review: '{review}' ==> Predicted Sentiment: '{prediction.upper()}'")