Clustering is a core technique in unsupervised machine learning used to discover natural groupings in data. The goal is to partition a dataset into several groups, or clusters, such that data points in the same cluster are very similar to each other, while data points in different clusters are very dissimilar. The algorithm is not told what the groups are beforehand; it must find these structures on its own.
1. K-Means Clustering (Centroid-Based)
K-Means is the most popular and straightforward clustering algorithm. It partitions the data into a pre-specified number of clusters ('K'). It's a centroid-based algorithm, meaning it works by identifying a center point, or centroid, for each cluster.
- How it Works:
1. Initialize: Randomly place 'K' centroids in the feature space.
2. Assign: Assign each data point to its nearest centroid, forming 'K' clusters.
3. Update: Recalculate the position of each centroid by taking the mean of all data points assigned to its cluster.
4. Repeat: The assignment and update steps are repeated iteratively until the centroids no longer move significantly.
- Best For: Datasets where the clusters are spherical, well-separated, and roughly equal in size.
- Key Parameter: You must specify the number of clusters, 'K', in advance.
2. Hierarchical Clustering (Connectivity-Based)
Hierarchical clustering creates a tree-like hierarchy of clusters called a dendrogram. It doesn't require you to specify the number of clusters beforehand. Instead, you can choose the number of clusters by looking at the dendrogram after the algorithm has run.
- How it Works (Agglomerative): This is the most common, "bottom-up" approach.
1. Initialize: Start by treating each data point as its own individual cluster.
2. Merge: Find the two closest clusters and merge them into a single new cluster.
3. Repeat: Repeat the merging process until only one cluster (containing all the data points) remains. The result is a dendrogram that shows the sequence of merges. You can then "cut" the dendrogram at a certain height to get a desired number of clusters.
- Best For: When you're not sure how many clusters exist in your data and you want to understand the hierarchy of relationships between data points.
- Key Parameter: The linkage criterion (e.g., 'ward', 'complete', 'average'), which defines how the distance between clusters is measured.
3. Density-Based Clustering (DBSCAN)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a powerful algorithm that groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions.
- How it Works: DBSCAN defines clusters as continuous regions of high density. It has two key parameters:
- eps (epsilon): The maximum distance between two samples for one to be considered as in the neighborhood of the other.
- min_samples: The number of samples in a neighborhood for a point to be considered as a core point. The algorithm starts with an arbitrary point and expands a cluster by finding all reachable neighbors within the eps radius. A key advantage is that it can find arbitrarily shaped clusters and automatically identify noise points (which are assigned a label of -1).
- Best For: Datasets with non-spherical clusters, varying cluster sizes and shapes, and the presence of noise or outliers.
- Key Parameter: eps and min_samples, which define the density of a cluster.
# --- 1. Import Necessary Libraries ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage
# --- 2. Prepare Sample Data ---
# We'll use scikit-learn's `make_blobs` to create a dataset with clear clusters.
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
# Scale the features, which is important for distance-based algorithms.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- 3. K-Means Clustering ---
# Initialize and train the K-Means model.
kmeans = KMeans(n_clusters=4, n_init='auto', random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)
kmeans_centroids = kmeans.cluster_centers_
# --- 4. Hierarchical Clustering (Agglomerative) ---
# Initialize and train the Hierarchical Clustering model.
# We specify n_clusters=4 to get the same number of groups as K-Means.
hierarchical = AgglomerativeClustering(n_clusters=4)
hierarchical_labels = hierarchical.fit_predict(X_scaled)
# --- 5. Density-Based Clustering (DBSCAN) ---
# Initialize and train the DBSCAN model.
# `eps` and `min_samples` are key parameters that need to be tuned.
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
# --- 6. Visualize the Clustering Results ---
# Create a figure with subplots to compare the algorithms.
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Comparison of Clustering Algorithms', fontsize=16)
# K-Means Plot
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans_labels, s=50, cmap='viridis')
axes[0].scatter(kmeans_centroids[:, 0], kmeans_centroids[:, 1], c='red', s=200, marker='X')
axes[0].set_title("K-Means Clustering")
axes[0].set_xlabel("Feature 1 (Scaled)")
axes[0].set_ylabel("Feature 2 (Scaled)")
# Hierarchical Clustering Plot
axes[1].scatter(X_scaled[:, 0], X_scaled[:, 1], c=hierarchical_labels, s=50, cmap='viridis')
axes[1].set_title("Hierarchical Clustering")
axes[1].set_xlabel("Feature 1 (Scaled)")
# DBSCAN Plot
# Note: Noise points are labeled -1 and will be plotted in a different color.
unique_labels = set(dbscan_labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (dbscan_labels == k)
xy = X_scaled[class_member_mask]
axes[2].plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
axes[2].set_title(f'DBSCAN (Estimated clusters: {len(unique_labels) - 1})')
axes[2].set_xlabel("Feature 1 (Scaled)")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# --- 7. Visualize the Dendrogram for Hierarchical Clustering ---
# The dendrogram shows the hierarchy of merges.
plt.figure(figsize=(15, 7))
# 'ward' is a common linkage method that minimizes the variance of the clusters being merged.
linked = linkage(X_scaled, method='ward')
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()