A Decision Tree is a supervised learning algorithm that is highly intuitive because it mimics human decision-making. It creates a model that looks like a flowchart or a tree structure, where each internal node represents a "test" on a feature, each branch represents the outcome of the test, and each leaf node represents a final decision or class label.
How a Decision Tree Learns
The core process of building a decision tree is called recursive partitioning. The algorithm starts with the entire dataset at the root and repeatedly splits it into smaller, more homogeneous subgroups.
1. Finding the Best Split: At each node, the algorithm evaluates every possible split on every feature. A "split" is a question like, "Is the credit_score <= 680?". The goal is to find the split that does the best job of separating the data into "pure" subgroups, where each subgroup ideally contains samples from only one class.
2. Measuring Purity (Gini Impurity & Entropy): To decide which split is "best," the algorithm uses a metric to measure the impurity or disorder of a group of samples. The two most common metrics are:
o Gini Impurity: Measures the probability of misclassifying a randomly chosen element from the dataset if it were randomly labeled according to the distribution of labels in the subset. A Gini score of 0 represents perfect purity (all elements belong to one class).
o Entropy and Information Gain: Entropy is a measure of disorder or uncertainty. A split that results in subgroups with low entropy is a good split. The algorithm calculates the Information Gain for each potential split, which is the reduction in entropy achieved by that split. It chooses the split with the highest information gain.
3. Recursive Splitting: Once the best split is found, the data is divided into child nodes. The algorithm then repeats the process for each child node, recursively finding the best split for each new subgroup.
4. Stopping the Growth: This process doesn't continue forever, as that would lead to a tree that perfectly memorizes the training data but fails on new data (overfitting). The splitting stops when one of the following stopping criteria is met:
o The tree reaches a predefined maximum depth (max_depth).
o A node becomes perfectly pure (all its samples belong to a single class).
o The number of samples in a node is too small to make a meaningful split (min_samples_split).
Making a Prediction
To classify a new, unseen data point, you start at the root of the tree. At each node, you answer the question based on the data point's feature values and follow the corresponding branch. This process continues until you reach a leaf node. The prediction is the majority class of the training samples that ended up in that leaf node.
Because you can follow this path of decisions, Decision Trees are known as "white box" models, as their logic is transparent and easy to interpret.
# --- 1. Import Necessary Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
# --- 2. Prepare Sample Data ---
# Create a dataset to classify whether a loan application should be approved.
# Features: Annual Income (in thousands), Credit Score (300-850), and Loan Amount (in thousands).
data = {
'income': [50, 200, 120, 30, 80, 150, 90, 40, 180, 110, 75, 220],
'credit_score': [700, 810, 680, 500, 750, 790, 650, 600, 820, 550, 720, 760],
'loan_amount': [100, 500, 250, 80, 150, 400, 200, 120, 600, 220, 180, 550],
'loan_approved': [1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1] # 1 = Yes, 0 = No
}
df = pd.DataFrame(data)
X = df[['income', 'credit_score', 'loan_amount']]
y = df['loan_approved']
# Split the data into a training set (for the model to learn from)
# and a testing set (to evaluate its performance on unseen data).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# --- 3. Create and Train the Decision Tree Classifier Model ---
# Initialize the classifier.
# `criterion='gini'`: Use the Gini Impurity metric to measure the quality of a split.
# `max_depth=3`: Limit the tree's depth to 3 levels to prevent overfitting.
# `random_state=42`: Ensures the results are reproducible.
model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
# Train the model on the training data.
model.fit(X_train, y_train)
print("--- Model Training Complete ---")
# --- 4. Make Predictions and Evaluate the Model ---
# Use the trained model to make predictions on the unseen test data.
y_pred = model.predict(X_test)
# Calculate performance metrics.
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=['Denied', 'Approved'])
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")
print("\n--- Classification Report ---")
print(class_report)
# --- 5. Visualize the Decision Tree ---
# We can plot the actual tree to see the decision rules it learned.
plt.figure(figsize=(15, 10))
plot_tree(model,
feature_names=['Income (k)', 'Credit Score', 'Loan Amount (k)'],
class_names=['Denied', 'Approved'],
filled=True,
rounded=True,
fontsize=10)
plt.title("Decision Tree for Loan Approval", fontsize=16)
plt.show()
# --- 6. Predict on New, Unseen Applicants ---
# Create a new DataFrame with new data points to predict.
new_applicants = pd.DataFrame({
'income': [65, 190],
'credit_score': [580, 750],
'loan_amount': [180, 450]
})
# Make predictions on the new data.
new_predictions = model.predict(new_applicants)
# Convert the numerical predictions (0 or 1) to meaningful labels.
predicted_labels = ['Approved' if pred == 1 else 'Denied' for pred in new_predictions]
print("\n--- Predictions for New Applicants ---")
for i, (index, applicant) in enumerate(new_applicants.iterrows()):
print(f"Applicant Details: {applicant.to_dict()} ==> Predicted Status: '{predicted_labels[i].upper()}'")