Despite its name, Logistic Regression is a fundamental classification algorithm, not a regression one. It's used to predict a discrete, categorical outcome. It is most commonly used for binary classification problems, where the outcome is one of two classes (e.g., Yes/No, True/False, 1/0).
The core idea is to take a linear equation (similar to Linear Regression) and pass the output through a special function called the sigmoid or logistic function.
The Sigmoid Function
The sigmoid function is an "S"-shaped curve that can take any real-valued number and map it into a value between 0 and 1. This is crucial because the output can be interpreted as a probability.
- If the output of the sigmoid function is greater than 0.5, the model predicts the data point belongs to Class 1.
- If the output is less than 0.5, it predicts Class 0.
So, while Linear Regression predicts a continuous value that can go to infinity, Logistic Regression predicts a probability that is always between 0 and 1.
Licensed by Google
How it Works
1. Linear Combination: Just like in Linear Regression, the algorithm starts by calculating a weighted sum of the input features. The equation is z = b₀ + b₁x₁ + b₂x₂ + ..., where b's are the model's coefficients.
2. Applying the Sigmoid Function: The result z is then passed into the sigmoid function, which squashes the output to a probability between 0 and 1.
3. Making a Prediction: The model uses a decision boundary (usually 0.5) to convert this probability into a class label. For example, if the calculated probability is 0.8 (which is > 0.5), the prediction is "Class 1". If the probability is 0.2 (which is < 0.5), the prediction is "Class 0".
4. Training: The model learns the optimal coefficients (b's) by using a cost function (like Log Loss) that measures how far the predicted probabilities are from the actual class labels in the training data. It then uses an optimization algorithm (like Gradient Descent) to minimize this cost.
Advantages
- Interpretability: It's a "white box" model. The coefficients of the trained model can be interpreted to understand the influence of each feature on the prediction.
- Efficiency: It's computationally inexpensive and fast to train, making it a great baseline model.
- Probabilistic Output: It provides not just a class prediction but also the probability of that prediction, which can be useful for understanding the model's confidence.
# --- 1. Import Necessary Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
# --- 2. Prepare Sample Data ---
# Create a dataset to classify whether a tumor is malignant or benign.
# Features: Tumor Size and Patient Age.
data = {
'tumor_size': [1.2, 5.5, 2.1, 8.0, 3.5, 7.2, 2.8, 6.5, 4.1, 1.8],
'patient_age': [25, 60, 35, 70, 40, 65, 30, 68, 45, 28],
'is_malignant': [0, 1, 0, 1, 0, 1, 0, 1, 1, 0] # 1 = Yes, 0 = No
}
df = pd.DataFrame(data)
X = df[['tumor_size', 'patient_age']]
y = df['is_malignant']
# Split the data for training and testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# --- 3. Feature Scaling ---
# While not strictly required for Logistic Regression, it's good practice
# and helps the optimization algorithm converge faster.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# --- 4. Create and Train the Logistic Regression Model ---
# Initialize the classifier.
# `random_state=42` ensures the results are reproducible.
model = LogisticRegression(random_state=42)
# Train the model on the scaled training data.
model.fit(X_train_scaled, y_train)
print("--- Model Training Complete ---")
# --- 5. Make Predictions and Evaluate the Model ---
# Use the trained model to make predictions on the unseen test data.
y_pred = model.predict(X_test_scaled)
# Calculate performance metrics.
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'])
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")
print("\n--- Confusion Matrix ---")
print(conf_matrix)
print("\n--- Classification Report ---")
print(class_report)
# --- 6. Visualize the Decision Boundary ---
# This helps us see how the model separates the two classes.
plt.figure(figsize=(10, 6))
# Create a meshgrid to plot the decision boundary
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Make predictions on the meshgrid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
# Plot the training data points
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.coolwarm)
plt.title("Logistic Regression Decision Boundary")
plt.xlabel("Tumor Size (Scaled)")
plt.ylabel("Patient Age (Scaled)")
plt.show()
# --- 7. Predict on New, Unseen Data ---
# Create a new DataFrame with new data points to predict.
new_patients = pd.DataFrame({
'tumor_size': [2.5, 7.0],
'patient_age': [32, 67]
})
# Scale the new data using the same scaler.
new_patients_scaled = scaler.transform(new_patients)
# Make predictions and get probabilities.
new_predictions = model.predict(new_patients_scaled)
new_probabilities = model.predict_proba(new_patients_scaled)
predicted_labels = ['Malignant' if pred == 1 else 'Benign' for pred in new_predictions]
print("\n--- Predictions for New Patients ---")
for i, (index, patient) in enumerate(new_patients.iterrows()):
print(f"Patient Details: {patient.to_dict()}")
print(f"==> Predicted Status: '{predicted_labels[i].upper()}'")
print(f" (Confidence: Benign={new_probabilities[i][0]:.2%}, Malignant={new_probabilities[i][1]:.2%})\n")