An example:
Code (R + Python)
Python codes
To manually perform train-validation-test, one can follow these steps:
- Split the dataset into training, validation, and test sets.
- Train the Lasso model on the training set using different
alpha
values. (thelambda
in Lasso is denoted asalpha
in sklearn) - Evaluate the model on the validation set to find the best
alpha
. - Retrain the model on the combined training and validation sets using the best
alpha
. - Finally, evaluate the model on the test set.
Here is the Python code to implement this:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
# Generate some sample data
np.random.seed(42)
X = np.random.randn(100, 5)
y = X[:, 0] * 3 + np.random.randn(100)
# Step 1: Split data into training (60%), validation (20%), and test (20%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Step 2: Define a range of alpha (lambda) values for Lasso
alpha_values = np.logspace(-4, 1, 50) # A range of 50 alpha values between 10^-4 and 10^1
validation_errors = [] # To store validation errors for each alpha
# Step 3: Train Lasso on training set and evaluate on validation set for each alpha
for alpha in alpha_values:
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train) # Train on the training set
y_val_pred = lasso.predict(X_val) # Predict on validation set
val_mse = mean_squared_error(y_val, y_val_pred) # Calculate MSE on validation set
validation_errors.append(val_mse)
# Find the best alpha that minimizes the validation MSE
best_alpha = alpha_values[np.argmin(validation_errors)]
print(f"Best alpha (lambda) found: {best_alpha}")
# Step 4: Retrain the model using the best alpha on the combined train and validation sets
X_train_val = np.vstack((X_train, X_val))
y_train_val = np.hstack((y_train, y_val))
best_lasso_model = Lasso(alpha=best_alpha)
best_lasso_model.fit(X_train_val, y_train_val)
# Step 5: Evaluate the model on the test set
y_test_pred = best_lasso_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f"Test Mean Squared Error: {test_mse}")
Explanation:
Data Splitting: We first split the dataset into 60% training, 20% validation, and 20% test sets using train_test_split
.
Alpha Search:
- A loop is used to train the Lasso model with different values of
alpha
(regularization parameter) on the training set. - For each
alpha
, we evaluate the model’s performance on the validation set and store the validation MSE.
Finding Best Alpha: The best alpha
is the one that minimizes the validation MSE.
Model Retraining: Once the best alpha
is identified, we retrain the Lasso model on the combined training and validation sets using this best alpha
.
Evaluation on Test Set: Finally, the trained model is evaluated on the test set, and the test MSE is reported.
R codes
To perform Lasso regression using train-validation-test split to find the optimal value of the regularization parameter lambda
, we follow these steps:
- Split the data into train, validation, and test sets.
- Train Lasso models with different
lambda
values on the training set. - Use the validation set to find the best
lambda
. - Retrain the model with the best
lambda
on the combined training and validation sets. - Evaluate the final model on the test set.
We’ll use the glmnet
package, which is popular for regularized regression.
R Code:
# Load necessary libraries
library(glmnet)
library(caret)
# Generate sample data
set.seed(42)
X <- matrix(rnorm(100 * 5), 100, 5)
y <- 3 * X[, 1] + rnorm(100)
# Step 1: Split the data into train (60%), validation (20%), and test (20%) sets
set.seed(42)
train_index <- createDataPartition(y, p = 0.6, list = FALSE)
X_train <- X[train_index, ]
y_train <- y[train_index]
X_temp <- X[-train_index, ]
y_temp <- y[-train_index]
# Split the remaining 40% into validation (20%) and test (20%)
set.seed(42)
val_index <- createDataPartition(y_temp, p = 0.5, list = FALSE)
X_val <- X_temp[val_index, ]
y_val <- y_temp[val_index]
X_test <- X_temp[-val_index, ]
y_test <- y_temp[-val_index]
# Step 2: Train Lasso models with different lambda values on training set
lambda_values <- 10^seq(-4, 1, length = 50) # A range of lambda values
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = lambda_values)
# Step 3: Find the best lambda by evaluating on validation set
val_predictions <- predict(lasso_model, X_val)
val_errors <- apply(val_predictions, 2, function(pred) mean((y_val - pred)^2))
# Get the best lambda with the minimum validation error
best_lambda <- lambda_values[which.min(val_errors)]
cat("Best lambda found:", best_lambda, "\n")
# Step 4: Retrain the Lasso model using the best lambda on combined train and validation sets
X_train_val <- rbind(X_train, X_val)
y_train_val <- c(y_train, y_val)
final_model <- glmnet(X_train_val, y_train_val, alpha = 1, lambda = best_lambda)
# Step 5: Evaluate the model on the test set
y_test_pred <- predict(final_model, X_test)
test_mse <- mean((y_test - y_test_pred)^2)
cat("Test Mean Squared Error:", test_mse, "\n")
Explanation:
- Data Splitting: The
createDataPartition
function fromcaret
is used to split the data into training, validation, and test sets. - Lasso Training: The
glmnet
function trains the Lasso model with differentlambda
values on the training set. - Validation: We use the
predict
function to make predictions on the validation set for eachlambda
value and calculate the validation MSE. - Finding Best Lambda: The best
lambda
is selected as the one with the smallest MSE on the validation set. - Retraining and Testing: Finally, we retrain the model on the combined training and validation sets with the best
lambda
and evaluate its performance on the test set.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.