K-Nearest Neighbours (KNN) is one of the simplest and most intuitive supervised machine learning algorithms. It's a non-parametric, "lazy learning" method, which means it doesn't try to build a complex internal model. Instead, it stores the entire training dataset and makes predictions based on the similarity between a new, unseen data point and the existing data.
The core idea is that a data point is likely to be similar to the data points that are closest to it in the feature space. The "K" in KNN refers to the number of nearest neighbours the algorithm will consider when making a prediction.
Analogy: Guessing a Person's Profession. Imagine you want to guess the profession of a new person in a neighbourhood. You don't know anything about them, but you know the professions of all their neighbours. If you look at their 3 nearest neighbours (K=3) and find that two are doctors and one is a lawyer, you might predict that the new person is also a doctor. This is the essence of how KNN works.
How It Works for Classification
1. Choose a value for K: You decide how many neighbours to consider (e.g., K=3, K=5). This is a crucial hyperparameter.
2. Calculate Distances: When a new, unlabeled data point is introduced, the algorithm calculates the distance between this new point and every single point in the training dataset. The most common distance metric used is the Euclidean distance.
3. Identify the K-Nearest Neighbours: The algorithm identifies the 'K' training data points that are closest to the new point.
4. Majority Vote: The new data point is assigned to the class that is most common among its K-nearest neighbours.
Detailed Code Example in Python
This example will walk you through a complete workflow for KNN classification, including the critical step of feature scaling.
# --- 1. Import Necessary Libraries ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from matplotlib.colors import ListedColormap
# --- 2. Generate and Prepare Sample Data ---
# Let's create some sample data with two distinct classes.
np.random.seed(42)
# Class 0 data points
X0 = np.random.randn(50, 2) + np.array([2, 2])
y0 = np.zeros(50)
# Class 1 data points
X1 = np.random.randn(50, 2) + np.array([-2, -2])
y1 = np.ones(50)
# Combine the data
X = np.vstack((X0, X1))
y = np.hstack((y0, y1))
# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# --- 3. Data Preprocessing: Feature Scaling ---
# KNN is a distance-based algorithm. If one feature has a much larger scale than another
# (e.g., age vs. salary), it will dominate the distance calculation.
# Therefore, it's crucial to scale the features to a comparable range.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# --- 4. Create and Train the KNN Model ---
# We choose K=5, meaning the model will look at the 5 nearest neighbors.
k = 5
model = KNeighborsClassifier(n_neighbors=k)
# "Training" in KNN is simple: the model just stores the training data.
model.fit(X_train_scaled, y_train)
print("--- Model Training Complete ---")
# --- 5. Make Predictions and Evaluate the Model ---
# Make predictions on the scaled test data.
y_pred = model.predict(X_test_scaled)
# Evaluate the model's accuracy.
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with K={k}: {accuracy * 100:.2f}%")
# --- 6. Visualize the Decision Boundary ---
# This helps us see how the model would classify any point in the feature space.
h = .02 # step size in the mesh
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Make predictions on the entire grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Create a color map
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, cmap=cmap_light)
# Plot the training points
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title(f'KNN Classification (K={k}) Decision Boundary', fontsize=16)
plt.xlabel('Feature 1 (Scaled)', fontsize=12)
plt.ylabel('Feature 2 (Scaled)', fontsize=12)
plt.show()
# --- Predict a new value ---
new_point = [[-1.5, -1.5]] # A new point to classify
# First, scale the new point using the same scaler
new_point_scaled = scaler.transform(new_point)
# Then, predict its class
predicted_class = model.predict(new_point_scaled)
print(f"\nPredicted class for new point {new_point[0]}: Class {int(predicted_class[0])}")