Study | StudyLover

Naive Bayes

K-Nearest Neighbours (KNN): The Core Concept : Multinomial Naive Bayes (MultinomialNB)

Unit 2: Guide to Machine Learning Algorithms

Naive Bayes is a simple yet powerful probabilistic classifier based on Bayes' Theorem. It's particularly effective for text classification tasks like spam detection and sentiment analysis. The "naive" part of its name comes from a strong, simplifying assumption it makes about the data.

The "Naive" Assumption: Feature Independence

The core assumption of Naive Bayes is that all the features of a data point are independent of one another, given the class. In the context of text classification, this means the algorithm assumes that the presence of one word in a document is completely unrelated to the presence of another word.

For example, in a spam email, it assumes that the word "free" and the word "money" appearing are independent events. While this is often not true in the real world (these words frequently appear together in spam), the algorithm works remarkably well in practice and is computationally very efficient.

How it Works for Text Classification (Spam Detection)

1. Calculate Prior Probabilities: The algorithm first calculates the overall probability of each class in the training data.

o P(Spam) = (Number of Spam Emails) / (Total Emails)

o P(Ham) = (Number of Ham Emails) / (Total Emails)

2. Calculate Likelihoods: It then calculates the probability of each word appearing, given a class.

o P("free" | Spam) = (Number of times "free" appears in spam emails) / (Total words in spam emails)

o P("meeting" | Ham) = (Number of times "meeting" appears in ham emails) / (Total words in ham emails)

3. Make a Prediction: When a new email comes in, it uses Bayes' theorem to calculate the posterior probability for each class. It multiplies the probabilities of all the words in the new email for each class. The class with the highest final probability is the prediction.

Types of Naive Bayes

MultinomialNB: Used for features that represent counts or frequencies (like word counts in a document).
GaussianNB: Used for continuous features that are assumed to follow a normal (Gaussian) distribution.
BernoulliNB: Used for binary features (e.g., a word is either present or not present).

Naive Bayes algorithm with a complete, commented code example

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import seaborn as sns

import matplotlib.pyplot as plt

# Let's create a simple dataset of emails and their labels.

data = {

'text': [

'Get free money now!', 'Limited time offer, win a prize', 'Meeting schedule for tomorrow',

'Project update and discussion', 'Congratulations you won a lottery prize',

'Please review the attached document', 'Claim your exclusive prize now',

'Team lunch meeting tomorrow at 12', 'URGENT: Your account needs attention',

'Can we reschedule our meeting?', 'Win a free vacation to paradise', 'Confirm your subscription to win'

'label': [

'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'spam'

]

}

df = pd.DataFrame(data)

# --- 3. Feature Extraction: Converting Text to Numbers ---

# Machine learning models can't understand text directly. We need to convert the text

# into numerical features. `CountVectorizer` is a simple way to do this. It creates a

# vocabulary of all the words in the text and counts the frequency of each word in each document.

vectorizer = CountVectorizer(stop_words='english')

X = vectorizer.fit_transform(df['text'])

y = df['label']

# Split the data for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 4. Create and Train the Naive Bayes Model ---

# We use MultinomialNB because our features are word counts.

model = MultinomialNB()

# Train the model using the training data.

model.fit(X_train, y_train)

print("--- Model Training Complete ---")

# --- 5. Make Predictions and Evaluate the Model ---

# Make predictions on the test data.

y_pred = model.predict(X_test)

# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred, labels=['ham', 'spam'])

class_report = classification_report(y_test, y_pred)

print(f"\nModel Accuracy: {accuracy * 100:.2f}%")

print("\n--- Classification Report ---")

print(class_report)

# Visualize the Confusion Matrix

plt.figure(figsize=(6, 5))

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',

xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])

plt.xlabel('Predicted Label')

plt.ylabel('True Label')

plt.title('Confusion Matrix')

plt.show()

# --- 6. Predict on New, Unseen Emails ---

new_emails = [

"Let's have a meeting about the project tomorrow",

"You won a free prize, claim your money now!",

"Your exclusive offer is waiting",

"Document for your review"

]

# We must use the same vectorizer that was fitted on the training data.

new_emails_transformed = vectorizer.transform(new_emails)

# Make predictions

new_predictions = model.predict(new_emails_transformed)

new_probabilities = model.predict_proba(new_emails_transformed)

print("\n--- Predictions for New Emails ---")

for email, prediction, probs in zip(new_emails, new_predictions, new_probabilities):

ham_prob = probs[0]

spam_prob = probs[1]

print(f"Email: '{email}'")

print(f"==> Predicted Label: '{prediction.upper()}'")

print(f" (Confidence: Ham={ham_prob:.2%}, Spam={spam_prob:.2%})\n")

K-Nearest Neighbours (KNN): The Core Concept Multinomial Naive Bayes (MultinomialNB)