Density-Based Clustering is a type of unsupervised learning that groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions. The most popular and important density-based algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
Unlike K-Means, which creates spherical clusters, DBSCAN can find arbitrarily shaped clusters and, crucially, it can automatically identify points that don't belong to any cluster, labeling them as noise or outliers.
How DBSCAN Works: The Core Concepts
DBSCAN defines clusters based on the density of data points in the feature space. It has two key parameters that you must define:
1. eps (epsilon): This is a distance measure. It defines the radius of a neighborhood around a data point.
2. min_samples: This is the minimum number of data points required to be within a point's eps-radius (including the point itself) for that point to be considered a core point.
Based on these parameters, DBSCAN categorizes every data point into one of three types:
- Core Point: A point that has at least min_samples within its eps-radius. These points are the heart of a cluster.
- Border Point: A point that is within the eps-radius of a core point but does not have enough neighbors to be a core point itself. These points are on the edge of a cluster.
- Noise Point (Outlier): A point that is neither a core point nor a border point. These points are in low-density regions.
The algorithm works by starting with an arbitrary point, finding all its neighbors within the eps-radius, and if it's a core point, it expands the cluster by finding the neighbors of its neighbors, and so on. This process continues until all points have been visited.
Advantages and Disadvantages
- Advantages:
- Does not require you to specify the number of clusters beforehand.
- Can find arbitrarily shaped clusters (e.g., circles, spirals).
- Is robust to outliers and can identify them as noise.
- Disadvantages:
- Performance is highly dependent on the choice of eps and min_samples, which can be difficult to determine.
- It struggles with clusters of varying densities, as the eps and min_samples parameters are fixed for the entire dataset.
- It can be sensitive to the distance metric used, especially in high-dimensional spaces.
# --- 1. Import Necessary Libraries ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons
# --- 2. Prepare Sample Data ---
# We'll use `make_moons` to create a dataset with non-spherical clusters,
# which is a perfect use case for DBSCAN where K-Means would fail.
X, y_true = make_moons(n_samples=300, noise=0.1, random_state=42)
# --- 3. Feature Scaling ---
# As a distance-based algorithm, it's good practice to scale the features.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- 4. Create and Train the DBSCAN Model ---
# Initialize the DBSCAN algorithm.
# `eps=0.3`: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
# `min_samples=5`: The number of samples in a neighborhood for a point to be considered as a core point.
# These parameters often require tuning for a specific dataset.
model = DBSCAN(eps=0.3, min_samples=5)
# Train the model and get the cluster labels for each data point.
# For DBSCAN, `.fit_predict()` performs the clustering and returns the labels.
# Note: Noise points are given the label -1.
cluster_labels = model.fit_predict(X_scaled)
print("--- Model Training Complete ---")
# Identify the number of clusters found (excluding noise)
n_clusters_ = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise_ = list(cluster_labels).count(-1)
print(f"Estimated number of clusters: {n_clusters_}")
print(f"Estimated number of noise points: {n_noise_}")
# --- 5. Visualize the Clustering Results ---
# This helps us see the arbitrarily shaped clusters and the identified noise points.
plt.figure(figsize=(10, 6))
# Create a mask for core samples, which are identified by the model.
core_samples_mask = np.zeros_like(model.labels_, dtype=bool)
core_samples_mask[model.core_sample_indices_] = True
# Get unique labels and set up colors.
unique_labels = set(cluster_labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
# Plot each cluster and the noise points.
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (cluster_labels == k)
# Plot the points in the cluster.
xy = X_scaled[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
# Plot the core points with a larger size.
xy = X_scaled[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14)
plt.title(f'DBSCAN Clustering Results (Estimated clusters: {n_clusters_})')
plt.xlabel("Feature 1 (Scaled)")
plt.ylabel("Feature 2 (Scaled)")
plt.grid(True)
plt.show()