Study | StudyLover

Random Forest Regression

Decision Tree Regressor : Neural Network Regression

Unit 2: Guide to Machine Learning Algorithms

Random Forest Regression is a powerful and widely used ensemble learning algorithm. The core idea behind an ensemble method is that by combining the predictions of several individual models, you can get a final prediction that is more accurate and robust than any of the individual models on their own.

A Random Forest is essentially a collection of many Decision Trees. It addresses the main weakness of a single Decision Tree—its tendency to overfit the training data—by introducing randomness into the model-building process.

How it Works:

1. Bootstrap Aggregating (Bagging): The algorithm creates multiple random subsets of the original training data (with replacement). Each of these smaller datasets is then used to train a separate Decision Tree. This means each tree in the "forest" learns from a slightly different set of data.

2. Feature Randomness: When building each tree, at every split point, the algorithm doesn't consider all the features. Instead, it selects a random subset of features and only considers those for the split. This forces the trees to be different from one another and prevents them from all relying on the same few important features.

3. Averaging Predictions: To make a prediction for a new data point, that point is passed down every single tree in the forest. Each tree makes its own prediction. The Random Forest then averages the predictions from all the individual trees to produce a single, final prediction.

This process of averaging the results from many diverse trees results in a model that is much less prone to overfitting and generally has a better predictive performance than a single, complex Decision Tree.

Code Example in Python

I have created a new Canvas document for you with this detailed code, which breaks down the entire process.

# --- 1. Import Necessary Libraries ---

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

# --- 2. Generate and Visualize Non-Linear Sample Data ---

# We'll use the same sine wave data to compare with the single Decision Tree.

np.random.seed(42)

X = np.sort(10 * np.random.rand(100, 1), axis=0)

y = np.sin(X).ravel() + np.random.randn(100) * 0.2

# Split the data for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. Create and Train the Random Forest Regressor Model ---

# Key Hyperparameters:

# - n_estimators: The number of trees in the forest. More trees generally improve accuracy but also increase training time.

# - max_depth: The maximum depth of each individual tree.

# - random_state: Ensures that the results are reproducible.

model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)

# Train the model using the training data.

# The .fit() method builds the entire forest of decision trees.

model.fit(X_train, y_train)

print("--- Model Training Complete ---")

# --- 4. Make Predictions and Evaluate the Model ---

# Use the trained model to make predictions on the test data.

y_pred = model.predict(X_test)

# Evaluate the model using R-squared.

r2 = r2_score(y_test, y_pred)

print(f"Model R-squared (R²) score: {r2:.4f}")

# Note: This score is typically higher and more reliable than a single Decision Tree's score.

# --- 5. Visualize the Results ---

# To plot the smoother curve of the Random Forest,

# we'll make predictions on a sorted range of X values.

X_plot = np.sort(X, axis=0)

y_plot_pred = model.predict(X_plot)

plt.figure(figsize=(10, 6))

# Plot the original data points

plt.scatter(X, y, color='darkorange', s=20, label="Actual Data")

# Plot the Random Forest regression curve

plt.plot(X_plot, y_plot_pred, color="teal", linewidth=3, label=f"Random Forest Fit (100 Trees)")

# Add titles and labels

plt.title('Random Forest Regression Fit', fontsize=16)

plt.xlabel('Feature (X)', fontsize=12)

plt.ylabel('Target (y)', fontsize=12)

plt.legend()

plt.grid(True)

plt.show()

# --- Predict a new value ---

new_value = [[5.0]] # Predict the output for an input of 5.0

predicted_value = model.predict(new_value)

print(f"\nPredicted value for X={new_value[0][0]}: {predicted_value[0]:.4f}")

Decision Tree Regressor Neural Network Regression