Linear Regression is a fundamental supervised learning algorithm used for predicting a continuous numerical value (like price, temperature, or age). It works by assuming a linear relationship between the input features (independent variables) and the output (dependent variable).
The goal of the algorithm is to find the "line of best fit" that passes through the data points. In a simple case with one input feature, this is a straight line. In cases with multiple features, it's a plane or a hyperplane. This line is chosen because it minimizes the total distance between itself and all the actual data points.
How it works: The model calculates the optimal values for the coefficients (the slope of the line) and the intercept (where the line crosses the y-axis). The equation for a simple linear regression line is:
y = mx + c
- y is the predicted value.
- m is the coefficient (slope).
- x is the input feature.
- c is the intercept.
The algorithm finds the values of m and c that result in the smallest possible sum of squared errors (also called residuals)—the squared vertical distances between each data point and the regression line.
Detailed Code Example in Python
This example will walk you through a complete machine learning workflow for Linear Regression using Python's Scikit-learn library. We will:
1. Generate sample data.
2. Split the data for training and testing.
3. Train a Linear Regression model.
4. Make predictions.
5. Evaluate the model's performance.
6. Visualize the results.
# --- 1. Import Necessary Libraries ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# --- 2. Generate and Prepare Sample Data ---
# Let's create some sample data where the relationship is roughly linear.
# We'll predict a person's salary based on their years of experience.
np.random.seed(42) # for reproducible results
years_experience = np.random.rand(50, 1) * 10 # 50 samples, 0-10 years
# Salary = base + (experience * factor) + some random noise
salary = 30000 + (years_experience * 5000) + np.random.randn(50, 1) * 4000
# Split the data into training and testing sets.
# The model will learn from the training set and be evaluated on the unseen testing set.
# test_size=0.2 means 20% of the data will be used for testing.
X_train, X_test, y_train, y_test = train_test_split(years_experience, salary, test_size=0.2, random_state=42)
# --- 3. Create and Train the Linear Regression Model ---
# Create an instance of the LinearRegression model
model = LinearRegression()
# Train the model using the training data.
# The .fit() method finds the optimal coefficient and intercept.
model.fit(X_train, y_train)
print("--- Model Training Complete ---")
print(f"Intercept (c): {model.intercept_[0]:,.2f}")
print(f"Coefficient (m): {model.coef_[0][0]:,.2f}")
# This means our model learned the equation: Salary = 5034.20 * Experience + 29693.88
# --- 4. Make Predictions ---
# Use the trained model to make predictions on the test data
y_pred = model.predict(X_test)
# --- 5. Evaluate the Model's Performance ---
# We'll use two common regression metrics:
# Mean Squared Error (MSE): The average of the squared differences between actual and predicted values. Lower is better.
# R-squared (R²): The proportion of the variance in the dependent variable that is predictable from the independent variable(s). Closer to 1 is better.
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("\n--- Model Evaluation ---")
print(f"Mean Squared Error (MSE): {mse:,.2f}")
print(f"R-squared (R²): {r2:.4f}")
# --- 6. Visualize the Results ---
# Let's plot the original data points and the regression line our model learned.
plt.figure(figsize=(10, 6))
# Scatter plot of the actual test data
plt.scatter(X_test, y_test, color='blue', label='Actual Data')
# Plot the regression line using the predictions
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line')
# Add titles and labels for clarity
plt.title('Salary vs. Years of Experience', fontsize=16)
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary (in INR)', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()
# --- Predict a new value ---
new_experience = [[12]] # Predict salary for someone with 12 years of experience
predicted_salary = model.predict(new_experience)
print(f"\nPredicted salary for {new_experience[0][0]} years of experience: ₹{predicted_salary[0][0]:,.2f}")