Study | StudyLover

K-Nearest Neighbours (KNN): The Core Concept

Unit 2: Guide to Machine Learning Algorithms

K-Nearest Neighbours (KNN) is one of the simplest and most intuitive supervised machine learning algorithms. It's a non-parametric, "lazy learning" method, which means it doesn't try to build a complex internal model. Instead, it stores the entire training dataset and makes predictions based on the similarity between a new, unseen data point and the existing data.

The core idea is that a data point is likely to be similar to the data points that are closest to it in the feature space. The "K" in KNN refers to the number of nearest neighbours the algorithm will consider when making a prediction.

Analogy: Guessing a Person's Profession. Imagine you want to guess the profession of a new person in a neighbourhood. You don't know anything about them, but you know the professions of all their neighbours. If you look at their 3 nearest neighbours (K=3) and find that two are doctors and one is a lawyer, you might predict that the new person is also a doctor. This is the essence of how KNN works.

How It Works for Classification

1. Choose a value for K: You decide how many neighbours to consider (e.g., K=3, K=5). This is a crucial hyperparameter.

2. Calculate Distances: When a new, unlabeled data point is introduced, the algorithm calculates the distance between this new point and every single point in the training dataset. The most common distance metric used is the Euclidean distance.

3. Identify the K-Nearest Neighbours: The algorithm identifies the 'K' training data points that are closest to the new point.

4. Majority Vote: The new data point is assigned to the class that is most common among its K-nearest neighbours.

Detailed Code Example in Python

This example will walk you through a complete workflow for KNN classification, including the critical step of feature scaling.

# --- 1. Import Necessary Libraries ---

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score

from matplotlib.colors import ListedColormap

# --- 2. Generate and Prepare Sample Data ---

# Let's create some sample data with two distinct classes.

np.random.seed(42)

# Class 0 data points

X0 = np.random.randn(50, 2) + np.array([2, 2])

y0 = np.zeros(50)

# Class 1 data points

X1 = np.random.randn(50, 2) + np.array([-2, -2])

y1 = np.ones(50)

# Combine the data

X = np.vstack((X0, X1))

y = np.hstack((y0, y1))

# Split the data for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 3. Data Preprocessing: Feature Scaling ---

# KNN is a distance-based algorithm. If one feature has a much larger scale than another

# (e.g., age vs. salary), it will dominate the distance calculation.

# Therefore, it's crucial to scale the features to a comparable range.

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# --- 4. Create and Train the KNN Model ---

# We choose K=5, meaning the model will look at the 5 nearest neighbors.

k = 5

model = KNeighborsClassifier(n_neighbors=k)

# "Training" in KNN is simple: the model just stores the training data.

model.fit(X_train_scaled, y_train)

print("--- Model Training Complete ---")

# --- 5. Make Predictions and Evaluate the Model ---

# Make predictions on the scaled test data.

y_pred = model.predict(X_test_scaled)

# Evaluate the model's accuracy.

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy with K={k}: {accuracy * 100:.2f}%")

# --- 6. Visualize the Decision Boundary ---

# This helps us see how the model would classify any point in the feature space.

h = .02 # step size in the mesh

x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1

y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Make predictions on the entire grid

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

# Create a color map

cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])

cmap_bold = ListedColormap(['#FF0000', '#0000FF'])

plt.figure(figsize=(10, 6))

plt.contourf(xx, yy, Z, cmap=cmap_light)

# Plot the training points

plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=20)

plt.xlim(xx.min(), xx.max())

plt.ylim(yy.min(), yy.max())

plt.title(f'KNN Classification (K={k}) Decision Boundary', fontsize=16)

plt.xlabel('Feature 1 (Scaled)', fontsize=12)

plt.ylabel('Feature 2 (Scaled)', fontsize=12)

plt.show()

# --- Predict a new value ---

new_point = [[-1.5, -1.5]] # A new point to classify

# First, scale the new point using the same scaler

new_point_scaled = scaler.transform(new_point)

# Then, predict its class

predicted_class = model.predict(new_point_scaled)

print(f"\nPredicted class for new point {new_point[0]}: Class {int(predicted_class[0])}")

Classification Naive Bayes