Random Forest Algorithm

#algorithms #machinelearning #datascience

What is a Random Forest?

A Random Forest is an ensemble model: it combines many decision trees to make better predictions.
Each tree makes its own prediction, and the forest averages (for regression) or votes (for classification) to decide the final answer.

What is an ensemble model:
An ensemble model combines multiple individual machine learning models to produce more accurate and robust predictions than any single model could achieve alone. It works by aggregating the results from a collection of "base" models, using techniques like voting, averaging, or more complex methods like stacking, bagging, and boosting. This approach is similar to seeking "the wisdom of the crowd" to improve overall performance.

Why Use Random Forests?

Reduces Overfitting: Single decision trees can memorize training data. Random forests use many trees, each trained on different parts of the data, so they generalize better.
Handles Complex Data: Works well with both numbers (regression) and categories (classification).
Feature Importance: Shows which features are most important for predictions.

How Does It Work?

Build many decision trees: Each tree is trained on a random sample of the data and considers a random subset of features.
Make predictions: For regression, average the predictions of all trees. For classification, take a majority vote.


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load your data
df = pd.read_csv('housing.csv')
X = df[['RM', 'LSTAT', 'PTRATIO']]
y = df['MEDV']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, max_depth=6, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf_model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

# Feature importance
importances = rf_model.feature_importances_
for feature, importance in zip(X.columns, importances):
    print(f"{feature}: {importance:.2f}")

How to Tune Random Forests
n_estimators: Number of trees (more trees = better, but slower).
max_depth: Maximum depth of each tree (prevents overfitting).
max_features: Number of features to consider at each split.
min_samples_split/leaf: Minimum samples needed to split a node or be a leaf.

Visualizing Feature Importance
Random Forests can show which features matter most:

import matplotlib.pyplot as plt

plt.bar(X.columns, rf_model.feature_importances_)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance in Random Forest')
plt.show()