Understanding Random Forests for Quantile Prediction with sklearn implementation

Random forests can provide uncertainty by predicting quantiles (e.g., 5th and 95th percentiles) rather than single-point estimates, which allows for a more nuanced understanding of the potential variation in outcomes. This methodology not only enhances predictive performance but also plays a crucial role in risk assessment, as it equips decision-makers with insights into both the lower and upper boundaries of predictions. By generating multiple trees and aggregating their predictions, random forests can reflect the inherent variability in the data, thereby offering a robust framework for understanding uncertainty in complex models. Furthermore, employing quantile regression techniques within this ensemble approach can enable practitioners to tailor their decisions based on specific risk tolerance levels, fostering more informed and strategic choices in uncertain environments.

First, let’s review what’s random forest first ?:

PyTorch implementation

Libraries and Dataset Loading

We start by importing the necessary libraries and loading the California Housing dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np

Load and Split the Dataset

We load the California Housing dataset and split it into training and testing sets.

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Standardize the Features

We standardize the features to improve the performance of some models.

# Standardize the features for better performance in some models
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Define Quantiles for Lower and Upper Bounds

We define the quantiles that will be used to estimate the lower and upper bounds of predictions.

# Define quantiles for lower and upper bounds
low_quantile = 0.05
high_quantile = 0.95

Initialize and Train the Random Forest Regressor

We initialize and train a Random Forest Regressor on the training data.

# Initialize and train the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

Predict Quantiles and Mean for Random Forest

We use the trained Random Forest Regressor to predict the lower and upper quantiles and the mean (central) estimates for the test data.

# Predict the lower and upper quantiles
low_quantile_pred_rf = np.percentile([tree.predict(X_test) for tree in rf_model.estimators_], low_quantile * 100, axis=0)
high_quantile_pred_rf = np.percentile([tree.predict(X_test) for tree in rf_model.estimators_], high_quantile * 100, axis=0)

# Predict the mean (central) estimate for Random Forest
mean_pred_rf = rf_model.predict(X_test)

Display Results for Random Forest

We print the mean predictions and the lower and upper quantile predictions for the first 10 instances in the test set.

# Display results for Random Forest
print("Random Forest Predictions:")
print("Mean Predictions:", mean_pred_rf[:10])
print(f"{int(low_quantile*100)}th Percentile Predictions (Lower Bound):", low_quantile_pred_rf[:10])
print(f"{int(high_quantile*100)}th Percentile Predictions (Upper Bound):", high_quantile_pred_rf[:10])

Summary

  • Import Libraries: Import required libraries.
  • Load and Split the Dataset: Load the California Housing dataset and split it into training and testing sets.
  • Standardize the Features: Standardize the features for better performance.
  • Define Quantiles: Define the quantiles for lower and upper bounds.
  • Train the Model: Train a Random Forest Regressor.
  • Predict Quantiles and Mean: Predict the lower, upper quantiles, and mean for the test set.
  • Display Results: Print the predictions.

All codes

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features for better performance in some models
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define quantiles for lower and upper bounds
low_quantile = 0.05
high_quantile = 0.95

# --- Random Forest with Quantile Estimation ---

# Initialize and train the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict the lower and upper quantiles
low_quantile_pred_rf = np.percentile([tree.predict(X_test) for tree in rf_model.estimators_], low_quantile * 100, axis=0)
high_quantile_pred_rf = np.percentile([tree.predict(X_test) for tree in rf_model.estimators_], high_quantile * 100, axis=0)

# Predict the mean (central) estimate for Random Forest
mean_pred_rf = rf_model.predict(X_test)

# Display results for Random Forest
print("Random Forest Predictions:")
print("Mean Predictions:", mean_pred_rf[:10])
print(f"{int(low_quantile*100)}th Percentile Predictions (Lower Bound):", low_quantile_pred_rf[:10])
print(f"{int(high_quantile*100)}th Percentile Predictions (Upper Bound):", high_quantile_pred_rf[:10])

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!