Study | StudyLover

Hierarchical Clustering

K-Means Clustering : Density-Based Clustering

Unit 2: Guide to Machine Learning Algorithms

Hierarchical Clustering is an unsupervised learning algorithm that, unlike K-Means, does not require you to pre-specify the number of clusters. Instead, it builds a hierarchy of clusters, which is often visualized as a tree-like diagram called a dendrogram.

This method is particularly useful when you're not sure how many clusters are in your data or when you want to understand the relationships and nested structure between different groups.

How it Works: The Agglomerative Approach

The most common type of hierarchical clustering is agglomerative, which is a "bottom-up" approach. It works as follows:

1. Initialize: The algorithm starts by treating every single data point as its own cluster. So, if you have 300 data points, you start with 300 clusters.

2. Merge the Closest Pair: It then finds the two closest clusters in the entire dataset and merges them into a single new cluster.

3. Repeat: This merging process is repeated iteratively. At each step, the two closest clusters are merged until only one large cluster, containing all the data points, remains.

The Dendrogram: Visualizing the Hierarchy

The key output of hierarchical clustering is the dendrogram. This diagram shows the sequence of merges that the algorithm performed.

The y-axis represents the distance or dissimilarity between clusters.
The x-axis represents the individual data points.
Vertical lines show the merges, and the height of the line indicates the distance at which the two clusters were merged.

By looking at the dendrogram, you can decide on the number of clusters. You "cut" the tree horizontally at a certain distance. The number of vertical lines your horizontal cut intersects is the number of clusters you will have.

Linkage Criterion: How to Measure Cluster Distance

A crucial part of the algorithm is how it measures the distance between two clusters (not just two points). This is determined by the linkage criterion:

Ward (most common): Merges the pair of clusters that leads to the minimum increase in the total within-cluster variance. It's a good default choice.
Complete Linkage: The distance between two clusters is the maximum distance between any two points in the two clusters.
Average Linkage: The distance between two clusters is the average distance between all pairs of points in the two clusters.

Advantages and Disadvantages

Advantages:

You don't need to specify the number of clusters beforehand.
The dendrogram provides a rich visualization of the relationships in the data.
It can work with any distance metric.

Disadvantages:

It can be computationally expensive, especially for large datasets, with a time complexity of at least O(n²).
The decisions to merge clusters are final and cannot be undone, which can lead to suboptimal clusters.
It can be sensitive to noise and outliers.

# --- 1. Import Necessary Libraries ---

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import AgglomerativeClustering

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import make_blobs

from scipy.cluster.hierarchy import dendrogram, linkage

# --- 2. Prepare Sample Data ---

# We'll use scikit-learn's `make_blobs` to create a dataset with clear clusters.

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# --- 3. Feature Scaling ---

# As a distance-based algorithm, it's good practice to scale the features.

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# --- 4. Visualize the Dendrogram to Determine Cluster Count ---

# The dendrogram helps us visualize the hierarchy and decide on the number of clusters.

plt.figure(figsize=(15, 7))

# 'ward' is a common linkage method that minimizes the variance of the clusters being merged.

linked = linkage(X_scaled, method='ward')

# Create the dendrogram plot.

dendrogram(linked,

orientation='top',

distance_sort='descending',

show_leaf_counts=True)

plt.title('Hierarchical Clustering Dendrogram')

plt.xlabel('Sample Index')

plt.ylabel('Distance (Ward Linkage)')

plt.show()

# From the dendrogram, we can see that cutting the tree where there are 4 distinct

# vertical lines seems like a good choice, confirming our initial data generation.

# --- 5. Create and Train the Hierarchical Clustering Model ---

# Initialize the model.

# `n_clusters=4`: We specify the number of clusters based on our dendrogram analysis.

# `linkage='ward'`: Specifies the linkage criterion.

model = AgglomerativeClustering(n_clusters=4, linkage='ward')

# Train the model and get the cluster labels for each data point.

# For this algorithm, `.fit_predict()` performs the clustering and returns the labels.

cluster_labels = model.fit_predict(X_scaled)

print("--- Model Training Complete ---")

print(f"\nFirst 10 cluster assignments: {cluster_labels[:10]}")

# --- 6. Visualize the Clustering Results ---

# This helps us see the final groups the algorithm has found.

plt.figure(figsize=(10, 6))

# Plot the data points, colored by their assigned cluster label.

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, s=50, cmap='viridis', alpha=0.7)

plt.title("Hierarchical Clustering Results (n_clusters=4)")

plt.xlabel("Feature 1 (Scaled)")

plt.ylabel("Feature 2 (Scaled)")

plt.grid(True)

plt.show()

K-Means Clustering Density-Based Clustering