Study | StudyLover

Logistic Regression

Decision Tree : Support Vector Machine (SVM)

Unit 2: Guide to Machine Learning Algorithms

Despite its name, Logistic Regression is a fundamental classification algorithm, not a regression one. It's used to predict a discrete, categorical outcome. It is most commonly used for binary classification problems, where the outcome is one of two classes (e.g., Yes/No, True/False, 1/0).

The core idea is to take a linear equation (similar to Linear Regression) and pass the output through a special function called the sigmoid or logistic function.

The Sigmoid Function

The sigmoid function is an "S"-shaped curve that can take any real-valued number and map it into a value between 0 and 1. This is crucial because the output can be interpreted as a probability.

If the output of the sigmoid function is greater than 0.5, the model predicts the data point belongs to Class 1.
If the output is less than 0.5, it predicts Class 0.

So, while Linear Regression predicts a continuous value that can go to infinity, Logistic Regression predicts a probability that is always between 0 and 1.

Licensed by Google

How it Works

1. Linear Combination: Just like in Linear Regression, the algorithm starts by calculating a weighted sum of the input features. The equation is z = b₀ + b₁x₁ + b₂x₂ + ..., where b's are the model's coefficients.

2. Applying the Sigmoid Function: The result z is then passed into the sigmoid function, which squashes the output to a probability between 0 and 1.

3. Making a Prediction: The model uses a decision boundary (usually 0.5) to convert this probability into a class label. For example, if the calculated probability is 0.8 (which is > 0.5), the prediction is "Class 1". If the probability is 0.2 (which is < 0.5), the prediction is "Class 0".

4. Training: The model learns the optimal coefficients (b's) by using a cost function (like Log Loss) that measures how far the predicted probabilities are from the actual class labels in the training data. It then uses an optimization algorithm (like Gradient Descent) to minimize this cost.

Advantages

Interpretability: It's a "white box" model. The coefficients of the trained model can be interpreted to understand the influence of each feature on the prediction.
Efficiency: It's computationally inexpensive and fast to train, making it a great baseline model.
Probabilistic Output: It provides not just a class prediction but also the probability of that prediction, which can be useful for understanding the model's confidence.

# --- 1. Import Necessary Libraries ---

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import seaborn as sns

# --- 2. Prepare Sample Data ---

# Create a dataset to classify whether a tumor is malignant or benign.

# Features: Tumor Size and Patient Age.

data = {

'tumor_size': [1.2, 5.5, 2.1, 8.0, 3.5, 7.2, 2.8, 6.5, 4.1, 1.8],

'patient_age': [25, 60, 35, 70, 40, 65, 30, 68, 45, 28],

'is_malignant': [0, 1, 0, 1, 0, 1, 0, 1, 1, 0] # 1 = Yes, 0 = No

}

df = pd.DataFrame(data)

X = df[['tumor_size', 'patient_age']]

y = df['is_malignant']

# Split the data for training and testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 3. Feature Scaling ---

# While not strictly required for Logistic Regression, it's good practice

# and helps the optimization algorithm converge faster.

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# --- 4. Create and Train the Logistic Regression Model ---

# Initialize the classifier.

# `random_state=42` ensures the results are reproducible.

model = LogisticRegression(random_state=42)

# Train the model on the scaled training data.

model.fit(X_train_scaled, y_train)

print("--- Model Training Complete ---")

# --- 5. Make Predictions and Evaluate the Model ---

# Use the trained model to make predictions on the unseen test data.

y_pred = model.predict(X_test_scaled)

# Calculate performance metrics.

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

class_report = classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'])

print(f"\nModel Accuracy: {accuracy * 100:.2f}%")

print("\n--- Confusion Matrix ---")

print(conf_matrix)

print("\n--- Classification Report ---")

print(class_report)

# --- 6. Visualize the Decision Boundary ---

# This helps us see how the model separates the two classes.

plt.figure(figsize=(10, 6))

# Create a meshgrid to plot the decision boundary

x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1

y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),

np.arange(y_min, y_max, 0.02))

# Make predictions on the meshgrid

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

# Plot the decision boundary

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)

# Plot the training data points

plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.coolwarm)

plt.title("Logistic Regression Decision Boundary")

plt.xlabel("Tumor Size (Scaled)")

plt.ylabel("Patient Age (Scaled)")

plt.show()

# --- 7. Predict on New, Unseen Data ---

# Create a new DataFrame with new data points to predict.

new_patients = pd.DataFrame({

'tumor_size': [2.5, 7.0],

'patient_age': [32, 67]

})

# Scale the new data using the same scaler.

new_patients_scaled = scaler.transform(new_patients)

# Make predictions and get probabilities.

new_predictions = model.predict(new_patients_scaled)

new_probabilities = model.predict_proba(new_patients_scaled)

predicted_labels = ['Malignant' if pred == 1 else 'Benign' for pred in new_predictions]

print("\n--- Predictions for New Patients ---")

for i, (index, patient) in enumerate(new_patients.iterrows()):

print(f"Patient Details: {patient.to_dict()}")

print(f"==> Predicted Status: '{predicted_labels[i].upper()}'")

print(f" (Confidence: Benign={new_probabilities[i][0]:.2%}, Malignant={new_probabilities[i][1]:.2%})\n")

Decision Tree Support Vector Machine (SVM)