K-Means is the most popular and widely used unsupervised learning algorithm for clustering. It's a centroid-based algorithm, which means its goal is to partition a dataset into a pre-specified number of clusters ('K'), where each cluster is represented by its center point, or centroid.
The algorithm is intuitive and relatively simple to understand. It aims to create clusters where the data points within a cluster are as similar as possible, and the data points in different clusters are as dissimilar as possible.
How K-Means Works: An Iterative Process
The K-Means algorithm works through a simple, iterative process of assigning data points to clusters and then updating the cluster centers.
1. Choose 'K': The first and most critical step is to decide on the number of clusters, 'K', that you want to find in your data. This is a hyperparameter that you must provide to the algorithm.
2. Initialize Centroids: The algorithm randomly selects 'K' data points from your dataset to serve as the initial centroids.
3. Assignment Step: The algorithm goes through each data point in the dataset and assigns it to the cluster of its nearest centroid. "Nearness" is typically measured using the Euclidean distance.
4. Update Step: After all data points have been assigned to a cluster, the algorithm recalculates the position of each of the 'K' centroids. The new position of a centroid is the mean (average) of all the data points that were assigned to its cluster.
5. Repeat until Convergence: Steps 3 and 4 are repeated iteratively. The algorithm continues to re-assign data points and update the centroids. This process stops when the positions of the centroids no longer change significantly from one iteration to the next, which means the clusters have stabilized and the algorithm has converged.
The final result is a set of 'K' clusters, with each data point in the dataset belonging to one of them.
Advantages and Disadvantages
- Advantages:
- Simple to understand and implement.
- Computationally efficient and fast, especially for large datasets.
- Works well on datasets with spherical, well-separated clusters.
- Disadvantages:
- You must specify the number of clusters, 'K', in advance.
- The random initialization of centroids can sometimes lead to different final clusters. Running the algorithm multiple times (as Scikit-learn does by default) helps mitigate this.
- It struggles with clusters of non-spherical shapes, varying sizes, and different densities.
- It is sensitive to outliers, which can pull the centroids away from the true center of a cluster.
# --- 1. Import Necessary Libraries ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import seaborn as sns
# --- 2. Prepare Sample Data ---
# We'll use scikit-learn's `make_blobs` to create a dataset with clear,
# spherical clusters, which is the ideal use case for K-Means.
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
# --- 3. Feature Scaling ---
# Since K-Means is a distance-based algorithm, its performance is affected by the
# scale of the features. It's good practice to scale the data to have a mean of 0
# and a standard deviation of 1.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- 4. Create and Train the K-Means Model ---
# Initialize the K-Means algorithm.
# `n_clusters=4`: We specify that we want to find 4 clusters.
# `n_init='auto'`: To handle the random initialization of centroids, this runs the algorithm
# multiple times with different starting points and chooses the best result.
# `random_state=42`: Ensures the results are reproducible.
model = KMeans(n_clusters=4, n_init='auto', random_state=42)
# Train the model on the scaled data.
# For K-Means, `.fit()` finds the cluster centers and assigns each point a label.
model.fit(X_scaled)
print("--- Model Training Complete ---")
# Get the cluster assignments for each data point and the final centroid locations.
cluster_labels = model.labels_
centroids = model.cluster_centers_
print(f"\nFirst 10 cluster assignments: {cluster_labels[:10]}")
print(f"\nFinal Centroid Locations:\n{centroids}")
# --- 5. Visualize the Clustering Results ---
# This helps us see the groups the algorithm has found.
plt.figure(figsize=(10, 6))
# Plot the data points, colored by their assigned cluster label.
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, s=50, cmap='viridis', alpha=0.7)
# Plot the final cluster centroids as red 'X's.
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.title("K-Means Clustering Results")
plt.xlabel("Feature 1 (Scaled)")
plt.ylabel("Feature 2 (Scaled)")
plt.legend()
plt.grid(True)
plt.show()
# --- 6. Predict the Cluster for New Data Points ---
# Create new, unseen data points.
new_data = np.array([[-2, -2], [3, 3], [0, 0]])
# Scale the new data using the same scaler that was fitted on the training data.
new_data_scaled = scaler.transform(new_data)
# Predict the cluster for the new data points.
new_predictions = model.predict(new_data_scaled)
print("\n--- Predictions for New Data Points ---")
for i, point in enumerate(new_data):
print(f"Point {point} ==> Predicted Cluster: {new_predictions[i]}")