Regression algorithms are a type of supervised machine learning used to predict a continuous numerical value, such as the price of a house, the temperature tomorrow, or the stock price of a company. The goal is to find a mathematical function that best maps the input features to the continuous output variable.
Python's Scikit-learn (sklearn) library provides easy-to-use implementations for all of these models.
1. Linear Regression
This is the simplest and most common regression algorithm. It assumes a linear relationship between the input features (X) and the output variable (y). The model's goal is to find the best-fitting straight line (or hyperplane in higher dimensions) that describes the data.
- Use Case: Predicting a value when you believe the relationship between variables is straightforward and linear (e.g., predicting a student's exam score based on the number of hours they studied).
- Code Example:
import numpy as np
from sklearn.linear_model import LinearRegression
# Features (e.g., size of house in sq. ft.)
X = np.array([[1400], [1600], [1700], [1875], [2100]])
# Labels (e.g., price in thousands of INR)
y = np.array([2450, 3120, 2790, 3080, 4000])
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Predict the price of a new 2000 sq. ft. house
new_house_size = [[2000]]
predicted_price = model.predict(new_house_size)
print(f"Predicted price for a {new_house_size[0][0]} sq. ft. house: ₹{predicted_price[0]:,.2f}k")
2. Ridge Regression
Ridge Regression is a variation of Linear Regression that includes L2 regularization. Regularization is a technique used to prevent overfitting (when a model learns the training data too well and performs poorly on new data). It does this by adding a penalty term to the cost function that discourages the model from having overly large coefficients.
- Use Case: Useful when you have a large number of features, especially when some of them are correlated (multicollinearity).
- Code Example:
from sklearn.linear_model import Ridge
# (Using the same data as Linear Regression)
X = np.array([[1400], [1600], [1700], [1875], [2100]])
y = np.array([2450, 3120, 2790, 3080, 4000])
# Create and train the model
# The 'alpha' parameter controls the strength of the regularization
model = Ridge(alpha=1.0)
model.fit(X, y)
# Predict the price
predicted_price = model.predict([[2000]])
print(f"Predicted price (Ridge): ₹{predicted_price[0]:,.2f}k")
3. Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator) is another variation of Linear Regression that uses L1 regularization. A key feature of Lasso is that it can shrink the coefficients of less important features to exactly zero, effectively performing automatic feature selection.
- Use Case: Excellent when you suspect that many of your input features are irrelevant or redundant.
- Code Example:
from sklearn.linear_model import Lasso
# (Using the same data as Linear Regression)
X = np.array([[1400], [1600], [1700], [1875], [2100]])
y = np.array([2450, 3120, 2790, 3080, 4000])
# Create and train the model
model = Lasso(alpha=1.0)
model.fit(X, y)
# Predict the price
predicted_price = model.predict([[2000]])
print(f"Predicted price (Lasso): ₹{predicted_price[0]:,.2f}k")
4. Decision Tree Regressor
A Decision Tree builds a model in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets by making a series of if-then-else decisions based on the input features. For regression, the final prediction is the average of the values in the terminal "leaf" node.
- Use Case: Good for capturing non-linear relationships in the data. It's easy to interpret and visualize.
- Code Example:
from sklearn.tree import DecisionTreeRegressor
# (Using the same data as Linear Regression)
X = np.array([[1400], [1600], [1700], [1875], [2100]])
y = np.array([2450, 3120, 2790, 3080, 4000])
# Create and train the model
model = DecisionTreeRegressor(random_state=0)
model.fit(X, y)
# Predict the price
predicted_price = model.predict([[2000]])
print(f"Predicted price (Decision Tree): ₹{predicted_price[0]:,.2f}k")
5. Random Forest Regressor
A Random Forest is an ensemble method that builds multiple Decision Trees and merges them to get a more accurate and stable prediction. It's one of the most popular and powerful machine learning algorithms because it helps to correct for the Decision Tree's habit of overfitting.
- Use Case: Excellent for complex regression problems where high accuracy is needed and you want to avoid overfitting.
- Code Example:
from sklearn.ensemble import RandomForestRegressor
# (Using the same data as Linear Regression)
X = np.array([[1400], [1600], [1700], [1875], [2100]])
y = np.array([2450, 3120, 2790, 3080, 4000])
# Create and train the model
# n_estimators is the number of trees in the forest
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X, y)
# Predict the price
predicted_price = model.predict([[2000]])
print(f"Predicted price (Random Forest): ₹{predicted_price[0]:,.2f}k")
6. Support Vector Regressor (SVR)
Support Vector Machines can also be used for regression. The goal of SVR is to find a function that deviates from the target values by a value no greater than a specified margin (epsilon), while being as "flat" as possible. It's effective in high-dimensional spaces and when the number of features is greater than the number of samples.
- Use Case: Good for high-dimensional data and problems where you are not expecting a simple linear fit.
- Code Example:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
# (Using the same data as Linear Regression)
# SVR is sensitive to feature scaling, so we scale the data first
X = np.array([[1400], [1600], [1700], [1875], [2100]])
y = np.array([2450, 3120, 2790, 3080, 4000])
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y.reshape(-1, 1)).ravel()
# Create and train the model
model = SVR(kernel='linear')
model.fit(X_scaled, y_scaled)
# Predict the price (must scale the input and inverse_transform the output)
new_house_scaled = scaler_X.transform([[2000]])
predicted_price_scaled = model.predict(new_house_scaled)
predicted_price = scaler_y.inverse_transform(predicted_price_scaled.reshape(-1, 1))
print(f"Predicted price (SVR): ₹{predicted_price[0][0]:,.2f}k")