Study | StudyLover

Polynomial Regression

Non-linear regression : Support Vector Regression (SVR)

Unit 2: Guide to Machine Learning Algorithms

Polynomial Regression is a type of regression algorithm that models the relationship between an independent variable (X) and a dependent variable (y) as an nth-degree polynomial. While it's considered a type of linear regression, it's used to model non-linear relationships.

The basic idea is to take a standard linear model and make it more flexible by adding polynomial terms (like x², x³, etc.) as new features. This allows the model to fit a curved line to the data instead of a straight one.

How it Works in Your Code:

1. PolynomialFeatures(degree=4): This is the key step. It takes your original feature X and transforms it. For each value x in X, it generates a new set of features: [1, x, x², x³, x⁴].

2. LinearRegression(): After the features have been transformed, a standard LinearRegression model is fitted to this new, expanded set of features. The model is still "linear" because it's a linear combination of these new features (β₀ + β₁x + β₂x² + ...), but the resulting curve that relates the original X to y is a non-linear polynomial.

Example:

# --- 1. Import Necessary Libraries ---

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import r2_score

# --- 2. Generate and Visualize Non-Linear Sample Data ---

# We'll create data that follows a clear curve, making it unsuitable for simple linear regression.

np.random.seed(0)

X = 2 - 3 * np.random.normal(0, 1, 100)

y = X - 2 * (X ** 2) + 0.5 * (X ** 3) + np.random.normal(-3, 3, 100)

# Reshape X to be a 2D array, which is required by scikit-learn

X = X[:, np.newaxis]

y = y[:, np.newaxis]

# Split data for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. The Core of Polynomial Regression: Feature Transformation ---

# This is the key step that separates polynomial from simple linear regression.

# We will create new features by raising the original feature to a power (e.g., X^2, X^3).

# We choose the degree of the polynomial. This is a hyperparameter you can tune.

# A degree of 2 creates a quadratic model (y = ax^2 + bx + c).

# Let's try degree 3, since our data was generated with a cubic function.

degree = 3

polynomial_features = PolynomialFeatures(degree=degree)

# Use `fit_transform` on the training data to create the polynomial features.

# It transforms our single feature [X] into [1, X, X^2, X^3].

X_poly_train = polynomial_features.fit_transform(X_train)

# We only use `transform` on the test data because we want to apply the

# same transformation that was learned from the training data.

X_poly_test = polynomial_features.transform(X_test)

print(f"Original shape of X_train: {X_train.shape}")

print(f"Shape of X_train after polynomial transformation (degree={degree}): {X_poly_train.shape}")

# --- 4. Train a Linear Model on the Transformed Features ---

# Even though the final curve is non-linear, the model itself is a LinearRegression model.

# It's "linear" because it's a linear combination of our new features (1, X, X^2, X^3).

model = LinearRegression()

model.fit(X_poly_train, y_train)

print("\n--- Model Training Complete ---")

# --- 5. Make Predictions and Evaluate the Model ---

y_poly_pred = model.predict(X_poly_test)

# R-squared is a good metric to evaluate regression models. It measures how much of the

# variance in the target variable is explained by the model. Closer to 1 is better.

r2 = r2_score(y_test, y_poly_pred)

print(f"Model R-squared (R²) score: {r2:.4f}")

# --- 6. Visualize the Results ---

# To plot a smooth curve, we'll make predictions on a sorted range of X values.

X_plot = np.sort(X, axis=0)

X_plot_poly = polynomial_features.transform(X_plot)

y_plot_pred = model.predict(X_plot_poly)

plt.figure(figsize=(10, 6))

# Plot the original data points

plt.scatter(X, y, color='blue', s=20, label="Actual Data")

# Plot the polynomial regression curve

plt.plot(X_plot, y_plot_pred, color='red', linewidth=3, label=f"Polynomial Fit (degree={degree})")

# Add titles and labels

plt.title('Polynomial Regression Fit', fontsize=16)

plt.xlabel('Feature (X)', fontsize=12)

plt.ylabel('Target (y)', fontsize=12)

plt.legend()

plt.grid(True)

plt.show()

# --- Predict a new value ---

new_value = [[1.5]] # Predict the output for an input of 1.5

# First, transform the new value into polynomial features

new_value_poly = polynomial_features.transform(new_value)

# Then, predict using the trained model

predicted_value = model.predict(new_value_poly)

print(f"\nPredicted value for X={new_value[0][0]}: {predicted_value[0][0]:.4f}")

The degree Parameter

The degree is the most important hyperparameter in Polynomial Regression. It controls the complexity of the curve the model can fit.

Low Degree (e.g., 1): This is just a simple linear regression (a straight line). It might underfit the data if the relationship is truly curved, meaning it's too simple to capture the underlying pattern.
Good Degree (e.g., 2-5): A moderate degree can capture a wide range of curved relationships effectively. In your code, degree=4 was chosen, which fits the sine-wave-like data quite well.
High Degree (e.g., 15): A very high degree will create an extremely flexible curve that tries to pass through every single data point. This will almost certainly lead to overfitting, where the model learns the noise in the training data perfectly but fails to generalize to new, unseen data.

Pros and Cons

✅ Pros

Flexibility: It can model a wide variety of non-linear relationships.
Simple to Implement: It's a straightforward extension of linear regression.
Provides a Good Approximation: It can offer a good approximation of the relationship between the independent and dependent variables.

❌ Cons

Prone to Overfitting: The biggest drawback. It's very easy to choose a degree that is too high, leading to a model that doesn't generalize well.
Sensitive to Outliers: Like linear regression, outliers can significantly skew the fit of the polynomial curve.
Computationally Expensive: As the degree increases, the number of features grows, which can make the model training process more computationally intensive.

Non-linear regression Support Vector Regression (SVR)