Study | StudyLover

Scikit-learn

Unit:1 Foundations of Python and Its Applications in Machine Learning

Scikit-learn: The Go-To Library for Machine Learning

Scikit-learn (often imported as sklearn) is the most popular and comprehensive open-source library for general-purpose machine learning in Python. It is the "Swiss Army knife" for data scientists, providing a vast collection of efficient tools for data mining and data analysis, built on top of NumPy, SciPy, and Matplotlib.

Its primary strength lies in its clean, consistent, and user-friendly API. This makes it incredibly easy to implement, test, and compare different machine learning algorithms without having to learn a new interface for each one.

Key Concepts

The Estimator API: This is the core design principle of Scikit-learn. Every algorithm is exposed via an "Estimator" object with a consistent set of methods:

.fit(data, labels): This method is used to train the model. It takes the training data (and labels for supervised learning) and learns the underlying patterns.
.predict(new_data): Once the model is trained, this method is used to make predictions on new, unseen data.
.transform(data): Used for data preprocessing steps to clean or restructure data.

Data Representation: Scikit-learn expects data to be stored in 2D NumPy arrays or Pandas DataFrames, where rows represent samples and columns represent features.

Code Examples

To run these examples, you first need to install Scikit-learn:

pip install scikit-learn

1. Data Preprocessing: Scaling Features

Most machine learning algorithms perform better when numerical input features are scaled to a standard range. StandardScaler is a common tool for this.

Example: Standardizing Data

This example scales the data so that each feature has a mean of 0 and a standard deviation of 1.

import numpy as np

from sklearn.preprocessing import StandardScaler

# Sample data with features of different scales (e.g., age and income)

data = np.array([[25, 50000],

[35, 80000],

[45, 62000],

[20, 30000]])

# Create a scaler object

scaler = StandardScaler()

# Fit the scaler to the data and transform it

scaled_data = scaler.fit_transform(data)

print("--- Original Data ---")

print(data)

print("\n--- Scaled Data (Mean=0, StdDev=1) ---")

print(scaled_data)

2. Classification: Predicting a Category

Classification is a supervised learning task where the goal is to predict a discrete label or category.

Example: Classifying Iris Flowers

This classic example uses the Iris dataset to predict the species of an iris flower based on its sepal and petal measurements.

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

# Load the dataset

iris = load_iris()

X, y = iris.data, iris.target

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the classifier (K-Nearest Neighbors)

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

# Make predictions on the test data

y_pred = knn.predict(X_test)

# Evaluate the model's accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Predict a new, unseen flower

new_flower = [[5.1, 3.5, 1.4, 0.2]] # Measurements for a new flower

prediction = knn.predict(new_flower)

print(f"Prediction for new flower: {iris.target_names[prediction[0]]}")

3. Regression: Predicting a Continuous Value

Regression is a supervised learning task where the goal is to predict a continuous numerical value.

Example: Predicting House Prices

This example uses a simple Linear Regression model to predict the price of a house based on its size.

import numpy as np

from sklearn.linear_model import LinearRegression

# Sample data: [Size in sq. ft.], [Price in thousands of INR]

X_train = np.array([[1400], [1600], [1700], [1875], [2100]])

y_train = np.array([2450, 3120, 2790, 3080, 4000])

# Create and train the regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict the price of a new 2000 sq. ft. house

new_house_size = [[2000]]

predicted_price = model.predict(new_house_size)

print(f"Predicted price for a {new_house_size[0][0]} sq. ft. house: ₹{predicted_price[0]:,.2f}k")

4. Clustering: Finding Unlabeled Groups

Clustering is an unsupervised learning task where the goal is to discover natural groupings in data without any pre-existing labels.

Example: Grouping Customers

This example uses the K-Means algorithm to group customers into two clusters based on their age and income.

import numpy as np

from sklearn.cluster import KMeans

# Sample unlabeled data: [Age, Annual Income in thousands of INR]

customer_data = np.array([[22, 150],

[25, 180],

[55, 600],

[62, 750],

[28, 220],

[58, 550]])

# Create and train the clustering model, asking it to find 2 clusters

kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)

kmeans.fit(customer_data)

# See which cluster each customer was assigned to

print(f"Cluster assignments: {kmeans.labels_}")

# Predict the cluster for a new customer

new_customer = [[30, 250]]

prediction = kmeans.predict(new_customer)

print(f"New customer belongs to Cluster: {prediction[0]}")

SciPy Bokeh