Study | StudyLover

Multinomial Naive Bayes (MultinomialNB)

Naive Bayes : Gaussian Naive Bayes (GaussianNB)

Unit 2: Guide to Machine Learning Algorithms

Multinomial Naive Bayes (MultinomialNB) is a specific type of the Naive Bayes algorithm that is designed to work with discrete data, particularly data that represents counts or frequencies. This makes it an excellent and highly popular choice for text classification tasks, such as the spam detection example in your Canvas.

The "Multinomial" part refers to the assumption that the features (in this case, the word counts) are generated from a multinomial distribution. In simple terms, it's a model that is very good at handling features that represent the number of times an event has occurred (e.g., the number of times the word "prize" appears in an email).

How it Works in Your Selected Code

Let's break down how MultinomialNB is used in the spam classification script:

1. Feature Extraction (CountVectorizer):

o The first crucial step is to convert the raw text of the emails into numerical features that MultinomialNB can understand. CountVectorizer does this by creating a matrix where each row is an email and each column represents a unique word from the entire vocabulary. The value in each cell is the count of how many times that word appeared in that email.

o This count-based feature representation is exactly what MultinomialNB is designed to work with.

2. Model Initialization (model = MultinomialNB()):

o This line creates an instance of the Multinomial Naive Bayes classifier. At this point, the model is empty and has not learned anything.

3. Training (model.fit(X_train, y_train)):

o This is the learning step. The .fit() method takes the word count matrix (X_train) and the corresponding labels (y_train, i.e., 'spam' or 'ham').

o During this process, the model calculates the probabilities it needs to make future predictions. It calculates:

§ The prior probability of each class (the overall likelihood of an email being 'spam' or 'ham' in the training data).

§ The likelihood of each word appearing, given a class (e.g., the probability of the word "money" appearing in an email, given that the email is 'spam').

4. Prediction (model.predict(X_test)):

o When making a prediction on a new email, the model uses these learned probabilities. It applies Bayes' Theorem to calculate the overall probability of the email belonging to each class ('spam' and 'ham'), based on the words it contains.

o The class that has the highest calculated probability is chosen as the final prediction.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import seaborn as sns

import matplotlib.pyplot as plt

# Let's create a simple dataset of emails and their labels.

data = {

'text': [

'Get free money now!', 'Limited time offer, win a prize', 'Meeting schedule for tomorrow',

'Project update and discussion', 'Congratulations you won a lottery prize',

'Please review the attached document', 'Claim your exclusive prize now',

'Team lunch meeting tomorrow at 12', 'URGENT: Your account needs attention',

'Can we reschedule our meeting?', 'Win a free vacation to paradise', 'Confirm your subscription to win'

'label': [

'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'spam'

]

}

df = pd.DataFrame(data)

# --- 3. Feature Extraction: Converting Text to Numbers ---

# Machine learning models can't understand text directly. We need to convert the text

# into numerical features. `CountVectorizer` is a simple way to do this. It creates a

# vocabulary of all the words in the text and counts the frequency of each word in each document.

vectorizer = CountVectorizer(stop_words='english')

X = vectorizer.fit_transform(df['text'])

y = df['label']

# Split the data for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 4. Create and Train the Naive Bayes Model ---

# We use MultinomialNB because our features are word counts.

model = MultinomialNB()

# Train the model using the training data.

model.fit(X_train, y_train)

print("--- Model Training Complete ---")

# --- 5. Make Predictions and Evaluate the Model ---

# Make predictions on the test data.

y_pred = model.predict(X_test)

# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred, labels=['ham', 'spam'])

class_report = classification_report(y_test, y_pred)

print(f"\nModel Accuracy: {accuracy * 100:.2f}%")

print("\n--- Classification Report ---")

print(class_report)

# Visualize the Confusion Matrix

plt.figure(figsize=(6, 5))

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',

xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])

plt.xlabel('Predicted Label')

plt.ylabel('True Label')

plt.title('Confusion Matrix')

plt.show()

# --- 6. Predict on New, Unseen Emails ---

new_emails = [

"Let's have a meeting about the project tomorrow",

"You won a free prize, claim your money now!",

"Your exclusive offer is waiting",

"Document for your review"

]

# We must use the same vectorizer that was fitted on the training data.

new_emails_transformed = vectorizer.transform(new_emails)

# Make predictions

new_predictions = model.predict(new_emails_transformed)

new_probabilities = model.predict_proba(new_emails_transformed)

print("\n--- Predictions for New Emails ---")

for email, prediction, probs in zip(new_emails, new_predictions, new_probabilities):

ham_prob = probs[0]

spam_prob = probs[1]

print(f"Email: '{email}'")

print(f"==> Predicted Label: '{prediction.upper()}'")

print(f" (Confidence: Ham={ham_prob:.2%}, Spam={spam_prob:.2%})\n")

Naive Bayes Gaussian Naive Bayes (GaussianNB)