The fundamental difference between Supervised and Unsupervised Learning lies in the type of data used for training: Supervised Learning uses labeled data, while Unsupervised Learning uses unlabeled data.
Think of it like this:
- Supervised Learning is like learning with a teacher who gives you a workbook with questions and the correct answers. You learn by comparing your results to the known answers.
- Unsupervised Learning is like being given a box of mixed LEGO bricks and being asked to sort them into groups. No one tells you what the groups should be; you have to discover the patterns (color, size, shape) on your own.
Supervised Learning: Learning with a Teacher
In supervised learning, the algorithm learns from a dataset where each data point is tagged with a correct output or "label." The goal is to learn a mapping function that can predict the output label for new, unseen data.
Key Characteristics:
- Goal: To predict an outcome or classify data.
- Input Data: Labeled data (features + correct answers).
- Process: The model is "trained" by comparing its predictions to the correct labels and adjusting its internal parameters to minimize errors.
- Main Types:
- Classification: The output is a category (e.g., "Spam" or "Not Spam," "Cat" or "Dog").
- Regression: The output is a continuous value (e.g., the price of a house, the temperature tomorrow).
Real-World Example: Email Spam Detection An algorithm is fed thousands of emails that have already been labeled by humans as either "spam" or "not spam." The model learns the features (words, sender, etc.) associated with spam and uses this knowledge to classify new, incoming emails.
Code Example: Predicting Diabetes (Classification)
Here, we'll train a simple model to predict whether a person has diabetes based on their age and blood glucose level. The data is labeled because we know the outcome for each patient in our training set.
# Import the K-Nearest Neighbors classifier
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# --- Step 1: Prepare the Labeled Data ---
# Features: [Age, Blood Glucose Level]
X_train = np.array([
[25, 110], [30, 95], [45, 160], [50, 180],
[22, 85], [60, 190], [35, 140]
])
# Labels: 0 = No Diabetes, 1 = Diabetes
# This is the "teacher" providing the correct answers for the training data.
y_train = np.array([0, 0, 1, 1, 0, 1, 1])
# --- Step 2: Create and Train the Model ---
# Create a classifier object
# It will find the 3 nearest neighbors to make a prediction
model = KNeighborsClassifier(n_neighbors=3)
# Train the model using our labeled data
model.fit(X_train, y_train)
# --- Step 3: Make a Prediction on New, Unseen Data ---
# Let's predict the outcome for a new patient: Age 48, Glucose 175
new_patient = np.array([[48, 175]])
prediction = model.predict(new_patient)
# --- Step 4: Interpret the Result ---
print(f"New Patient Data: {new_patient[0]}")
if prediction[0] == 1:
print("Prediction: The model predicts this patient has Diabetes. 🩺")
else:
print("Prediction: The model predicts this patient does not have Diabetes. ✅")
# Expected Output: The model predicts this patient has Diabetes.
Unsupervised Learning: Finding Hidden Patterns
In unsupervised learning, the algorithm is given a dataset without any labels. The goal is to explore the data and find some inherent structure or patterns within it on its own.
Key Characteristics:
- Goal: To discover hidden patterns or group similar data points.
- Input Data: Unlabeled data (features only, no answers).
- Process: The model tries to learn the relationships between the data points by grouping them or identifying outliers.
- Main Types:
- Clustering: Grouping similar data points together (e.g., customer segmentation).
- Association: Discovering rules that describe large portions of your data (e.g., "customers who buy bread also tend to buy milk").
Real-World Example: Customer Segmentation A retail company has data on the purchasing habits of its customers but doesn't have pre-defined customer "types." An unsupervised learning algorithm can process this data and automatically group customers into segments (e.g., "budget shoppers," "brand loyalists," "weekend shoppers") based on their shared behaviors.
Code Example: Grouping Customers (Clustering)
Here, we have data on customer spending habits, but we don't have labels. We want the model to discover natural groupings (clusters) within the data.
# Import the KMeans clustering algorithm
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# --- Step 1: Prepare the Unlabeled Data ---
# Features: [Annual Income (in thousands), Spending Score (1-100)]
# Notice there are no 'y_train' labels. The model knows nothing about these customers.
X = np.array([
[25, 75], [30, 80], [28, 60],
[60, 30], [55, 25], [65, 20],
[45, 50], [50, 55]
])
# --- Step 2: Create and Train the Model ---
# Create a KMeans object. We'll ask it to find 2 clusters.
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
# Train the model. It will find the best centers for the 2 clusters.
kmeans.fit(X)
# The model has now assigned a cluster (0 or 1) to each data point.
cluster_labels = kmeans.labels_
print(f"Cluster assignments for each customer: {cluster_labels}")
# --- Step 3: Make a Prediction for a New Customer ---
# Let's see which cluster a new customer belongs to: Income 58k, Spending Score 28
new_customer = np.array([[58, 28]])
prediction = kmeans.predict(new_customer)
print(f"\nNew Customer Data: {new_customer[0]}")
print(f"Prediction: This new customer belongs to Cluster {prediction[0]}.")
# --- Optional: Visualize the results ---
plt.figure(figsize=(8, 6))
# Plot the data points, coloring them by their assigned cluster
plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis', marker='o', s=100, label='Existing Customers')
# Plot the new customer
plt.scatter(new_customer[:, 0], new_customer[:, 1], c='red', marker='*', s=200, label='New Customer')
# Plot the cluster centers
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=250, c='blue', marker='X', label='Cluster Centers')
plt.title('Customer Segmentation')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid(True)
plt.show()