Scikit-learn: The Go-To Library for Machine Learning
Scikit-learn (often imported as sklearn) is the most popular and comprehensive open-source library for general-purpose machine learning in Python. It is the "Swiss Army knife" for data scientists, providing a vast collection of efficient tools for data mining and data analysis, built on top of NumPy, SciPy, and Matplotlib.
Its primary strength lies in its clean, consistent, and user-friendly API. This makes it incredibly easy to implement, test, and compare different machine learning algorithms without having to learn a new interface for each one.
Key Concepts
- The Estimator API: This is the core design principle of Scikit-learn. Every algorithm is exposed via an "Estimator" object with a consistent set of methods:
- .fit(data, labels): This method is used to train the model. It takes the training data (and labels for supervised learning) and learns the underlying patterns.
- .predict(new_data): Once the model is trained, this method is used to make predictions on new, unseen data.
- .transform(data): Used for data preprocessing steps to clean or restructure data.
- Data Representation: Scikit-learn expects data to be stored in 2D NumPy arrays or Pandas DataFrames, where rows represent samples and columns represent features.
Code Examples
To run these examples, you first need to install Scikit-learn:
pip install scikit-learn
1. Data Preprocessing: Scaling Features
Most machine learning algorithms perform better when numerical input features are scaled to a standard range. StandardScaler is a common tool for this.
- Example: Standardizing Data
This example scales the data so that each feature has a mean of 0 and a standard deviation of 1.
import numpy as np
from sklearn.preprocessing import StandardScaler
# Sample data with features of different scales (e.g., age and income)
data = np.array([[25, 50000],
[35, 80000],
[45, 62000],
[20, 30000]])
# Create a scaler object
scaler = StandardScaler()
# Fit the scaler to the data and transform it
scaled_data = scaler.fit_transform(data)
print("--- Original Data ---")
print(data)
print("\n--- Scaled Data (Mean=0, StdDev=1) ---")
print(scaled_data)
2. Classification: Predicting a Category
Classification is a supervised learning task where the goal is to predict a discrete label or category.
- Example: Classifying Iris Flowers
This classic example uses the Iris dataset to predict the species of an iris flower based on its sepal and petal measurements.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the classifier (K-Nearest Neighbors)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Predict a new, unseen flower
new_flower = [[5.1, 3.5, 1.4, 0.2]] # Measurements for a new flower
prediction = knn.predict(new_flower)
print(f"Prediction for new flower: {iris.target_names[prediction[0]]}")
3. Regression: Predicting a Continuous Value
Regression is a supervised learning task where the goal is to predict a continuous numerical value.
- Example: Predicting House Prices
This example uses a simple Linear Regression model to predict the price of a house based on its size.
import numpy as np
from sklearn.linear_model import LinearRegression
# Sample data: [Size in sq. ft.], [Price in thousands of INR]
X_train = np.array([[1400], [1600], [1700], [1875], [2100]])
y_train = np.array([2450, 3120, 2790, 3080, 4000])
# Create and train the regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict the price of a new 2000 sq. ft. house
new_house_size = [[2000]]
predicted_price = model.predict(new_house_size)
print(f"Predicted price for a {new_house_size[0][0]} sq. ft. house: ₹{predicted_price[0]:,.2f}k")
4. Clustering: Finding Unlabeled Groups
Clustering is an unsupervised learning task where the goal is to discover natural groupings in data without any pre-existing labels.
- Example: Grouping Customers
This example uses the K-Means algorithm to group customers into two clusters based on their age and income.
import numpy as np
from sklearn.cluster import KMeans
# Sample unlabeled data: [Age, Annual Income in thousands of INR]
customer_data = np.array([[22, 150],
[25, 180],
[55, 600],
[62, 750],
[28, 220],
[58, 550]])
# Create and train the clustering model, asking it to find 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(customer_data)
# See which cluster each customer was assigned to
print(f"Cluster assignments: {kmeans.labels_}")
# Predict the cluster for a new customer
new_customer = [[30, 250]]
prediction = kmeans.predict(new_customer)
print(f"New customer belongs to Cluster: {prediction[0]}")