Lasso Regression and LassoCV: methods & Python codes

The Lasso (Least Absolute Shrinkage and Selection Operator) is a regression technique that enhances prediction accuracy and interpretability by applying L1 regularization to shrink coefficients. Unlike traditional regression methods, Lasso forces some coefficients to become exactly zero, effectively performing feature selection and reducing model complexity. This makes it particularly useful in high-dimensional datasets where many features may be irrelevant or redundant. However, Lasso can sometimes over-shrink coefficients, leading to biased estimates, so careful tuning of the regularization parameter (alpha) via cross-validation is essential. It’s widely used in machine learning, economics, and bioinformatics for its ability to balance complexity and predictive power after its first appearance in the paper: “Regression shrinkage and selection via the lasso.” published in Journal of the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996): 267-288.

Details:

  • L_1 Penalty: Lasso uses the L_1 penalty, which is the sum of the absolute values of the coefficients.
  • Mathematical Form: The Lasso regression objective function is:
    \text{Minimize} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - \hat{y}i)^2 + \lambda \sum{j=1}^{p} |w_j| \right)
    where y_i are the actual values, \hat{y}_i are the predicted values, w_j are the coefficients, n is the number of observations, p is the number of features, and \lambda is the regularization parameter.
  • Effect: Lasso can shrink some coefficients to exactly zero, effectively performing feature selection by excluding some features from the model.

On the L_1 Penalty of Lasso:

  • The L_1 penalty adds the absolute value of the magnitude of the coefficients as a penalty term to the loss function.
  • It encourages sparsity, meaning it can reduce some coefficients to zero, leading to simpler models with fewer features.
  • Useful for feature selection.

Implementation:

In the following, we will:

  1. Import Libraries: Essential libraries are imported for machine learning functionalities.
  2. Load Dataset: A synthetic dataset is created using make_regression with 100 samples and 20 features.
  3. Split Dataset: The dataset is split into training and test sets using train_test_split.
  4. Train Lasso Model: A Lasso regression model is instantiated with a specified alpha value (regularization strength) and trained using the training data.
  5. Evaluate Model: The model’s performance is evaluated on both the training and test sets using mean squared error (MSE) and R² score.
  6. Visualize Results: The coefficients of the Lasso model are plotted to visualize the effect of L1 regularization on the feature coefficients.

You can adjust the alpha parameter to see how different levels of regularization affect the model’s performance and the coefficients. We will talk about how to choose alpha in a later post in tuning hyperparameters.

Python codes:

# Step 1: Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the dataset (using a synthetic dataset for this example)
X, y = make_regression(n_samples=100, n_features=20, noise=0.1, random_state=42)

# Step 3: Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Lasso regression model
alpha = 0.1  # Regularization strength
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)

# Step 5: Evaluate the model
y_train_pred = lasso.predict(X_train)
y_test_pred = lasso.predict(X_test)

print("Training set evaluation:")
print("Mean Squared Error:", mean_squared_error(y_train, y_train_pred))
print("R^2 Score:", r2_score(y_train, y_train_pred))

print("\nTest set evaluation:")
print("Mean Squared Error:", mean_squared_error(y_test, y_test_pred))
print("R^2 Score:", r2_score(y_test, y_test_pred))

# Step 6: Visualize the results (if applicable, here we'll plot the coefficients)
plt.figure(figsize=(10, 6))
plt.plot(lasso.coef_, marker='o', linestyle='none')
plt.title('Lasso Regression Coefficients')
plt.xlabel('Feature Index')
plt.ylabel('Coefficient Value')
plt.grid(True)
plt.show()

Run in Colab

Analyzing results: the results looks like this

Training set evaluation:
Mean Squared Error: 0.11797011924966153
R^2 Score: 0.9999962881305299

Test set evaluation:
Mean Squared Error: 0.1186764119872078
R^2 Score: 0.9999927871282901

So, you can see from the graph that most of the coefficients are shrunken to 0. From there, we can detect the features that are important to the prediction of the output.

R code

The corresponding codes in R is

# Step 1: Install and load the necessary packages
install.packages("glmnet")
install.packages("caret")
install.packages("e1071")
library(glmnet)
library(caret)
library(e1071)

# Step 2: Load the dataset (using a synthetic dataset for this example)
set.seed(42)
n <- 100
p <- 20
X <- matrix(rnorm(n * p), n, p)
beta <- rnorm(p)
y <- X %*% beta + rnorm(n) * 0.1

# Step 3: Split the dataset into training and test sets
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]

# Step 4: Train the Lasso regression model
alpha <- 1  # Lasso regression (L1 penalty)
lasso_model <- cv.glmnet(X_train, y_train, alpha = alpha)

# Step 5: Evaluate the model
y_train_pred <- predict(lasso_model, X_train, s = "lambda.min")
y_test_pred <- predict(lasso_model, X_test, s = "lambda.min")

train_mse <- mean((y_train - y_train_pred)^2)
test_mse <- mean((y_test - y_test_pred)^2)
train_r2 <- 1 - sum((y_train - y_train_pred)^2) / sum((y_train - mean(y_train))^2)
test_r2 <- 1 - sum((y_test - y_test_pred)^2) / sum((y_test - mean(y_test))^2)

cat("Training set evaluation:\n")
cat("Mean Squared Error:", train_mse, "\n")
cat("R^2 Score:", train_r2, "\n")

cat("\nTest set evaluation:\n")
cat("Mean Squared Error:", test_mse, "\n")
cat("R^2 Score:", test_r2, "\n")

# Step 6: Visualize the results (if applicable, here we'll plot the coefficients)
lasso_coefficients <- coef(lasso_model, s = "lambda.min")

plot(lasso_coefficients, main = "Lasso Regression Coefficients", xlab = "Feature Index", ylab = "Coefficient Value", pch = 16, col = "blue")

Here, we:

  1. Install and Load Packages: Install and load the necessary packages (glmnet for Lasso regression, caret for data partitioning, and e1071 for additional functions).
  2. Load Dataset: A synthetic dataset is created with 100 samples and 20 features.
  3. Split Dataset: The dataset is split into training and test sets using createDataPartition from the caret package.
  4. Train Lasso Model: A Lasso regression model is trained using cv.glmnet with cross-validation to find the optimal lambda (regularization parameter).
  5. Evaluate Model: The model’s performance is evaluated on both the training and test sets using mean squared error (MSE) and R² score.
  6. Visualize Results: The coefficients of the Lasso model are plotted to visualize the effect of L1 regularization on the feature coefficients.

Drawbacks of Lasso regression

  • Handling of Correlated Variables:
    • When dealing with highly correlated variables, lasso tends to select only one of them and arbitrarily sets the coefficients of the others to zero. This can be problematic when all of the correlated variables are actually relevant.
    • This arbitrary selection can make the model’s interpretation challenging, as it may not accurately reflect the true relationships between variables.
  • Limitations When p > n:
    • In situations where the number of predictors (p) is greater than the number of observations (n), lasso is limited in the number of variables it can select.
    • This limitation can be a significant issue in fields like genomics, where datasets often have many more features than samples.
  • Bias:
    • The L1 penalty used in lasso can introduce bias into the model by shrinking coefficients towards zero. While this helps with feature selection and reduces variance, it can also lead to underfitting.
  • Sensitivity to Data Scaling:
    • Lasso is sensitive to the scaling of the input features. Therefore it is very important to scale your data before using lasso.
  • Not always the best for prediction accuracy:
    • While lasso is great for feature selection, sometimes ridge regression will produce a model with higher prediction accuracy, due to the bias that lasso introduces.

In summary, while lasso is valuable for feature selection and handling high-dimensional data, it’s essential to be aware of its limitations, particularly when dealing with correlated variables or when the number of predictors exceeds the number of observations.

Extra note:

The importance of data normalization:

Lasso’s L1 penalty penalizes the absolute magnitude of these coefficients equally, regardless of the original scale of the features. Without normalization, a feature measured on a smaller scale (requiring a larger coefficient to exert influence) would be penalized more heavily by Lasso than a feature measured on a larger scale (which might have a naturally smaller coefficient). This discrepancy arises because the Lasso method treats all coefficients uniformly, ignoring the diverse units in which these features may be expressed. As a result, the regularization effect becomes dependent on the arbitrary units of measurement, which can distortingly influence the model’s outcome. This potentially leads Lasso to incorrectly shrink or eliminate coefficients of important variables simply because their scale requires larger coefficient values. Such a situation not only undermines the interpretability of the model but also affects predictive performance, as valuable insights from critical variables could be lost. Consequently, it is crucial to apply feature scaling techniques, such as standardization or normalization, prior to implementing Lasso regression. By doing so, we ensure that each feature contributes more meaningfully to the model without being disproportionately influenced by its scale, thereby enhancing the robustness and accuracy of the outcomes derived from the analysis.

K-Fold Cross-Validation for Optimal Alpha Selection

K-Fold Cross-Validation is commonly used for Optimal Alpha Selection in Lasso, a technique that ensures that the parameter tuning process is both reliable and robust. To perform cross-validation for Lasso, a set of candidate α values must be defined over which the model performance will be evaluated. Selecting an appropriate range and density for this grid is important for effectively identifying the optimal regularization strength.  

  • Range: The range should typically cover several orders of magnitude. It should include very small values (e.g., 10e−4, 10e−3), which approximate OLS behavior and allow for complex models, up to large values (e.g., 1, 10, 100 or more) that induce significant shrinkage and high sparsity, leading to simpler models. The maximum relevant α (α max​) can often be determined analytically – it’s the smallest value for which all coefficients are zero.  
  • Spacing: Since Lasso’s effects (coefficient shrinkage and variable selection) are often more pronounced for smaller changes in α when α itself is small, it is common practice to space the candidate α values logarithmically rather than linearly. Functions like numpy.logspace or numpy.arange applied to the logarithm of α are suitable for generating such sequences.  
  • Automated Paths: Tools like scikit-learn’s LassoCV can automatically generate a path of α values if none are explicitly provided. This path generation is typically controlled by parameters like n_alphas (the number of α values on the path, default often 100) and eps (the ratio αmin​/αmax​, default often 0.001). This relieves the user from manually specifying the grid but offers less direct control.  

Code

In the following code, let’s implement Lasso regression with automatic alpha selection using cross-validation (LassoCV) on the California housing dataset. It starts by loading the dataset, splitting it into training and testing sets, and normalizing features using StandardScaler to improve model performance. The LassoCV model is trained with 10-fold cross-validation to find the optimal regularization parameter (alpha), which helps prevent overfitting by shrinking irrelevant feature coefficients. After selecting the best alpha, predictions are made on the test set, and the mean squared error (MSE) is calculated to evaluate performance. Finally, the script visualizes the learned feature coefficients using a bar chart, showing the relative importance of each feature in predicting housing prices:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Perform LassoCV with 10-fold cross-validation
lasso_cv = LassoCV(cv=10, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)

# Get best alpha
best_alpha = lasso_cv.alpha_
print(f"Best alpha selected by LassoCV: {best_alpha:.4f}")

# Predict and evaluate
y_pred = lasso_cv.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print(f"Lasso Model MSE: {mse:.4f}")

# Plot feature coefficients
plt.figure(figsize=(10, 6))
plt.bar(data.feature_names, lasso_cv.coef_)
plt.xticks(rotation=45)
plt.xlabel("Feature")
plt.ylabel("Coefficient Value")
plt.title("Lasso Regression Feature Importance")
plt.show()

Run in Colab

Output:

Best alpha selected by LassoCV: 0.0009
Lasso Model MSE: 0.5547

The above figure oyutput visualizes the learned feature coefficients using a bar chart, showing the relative importance of each feature in predicting housing prices.

Next, let’s visualize the relationship between the regularization strength (alpha) and mean squared error (MSE) in a Lasso regression model using cross-validation. It extracts the range of alpha values tested by LassoCV and computes the average MSE across folds for each alpha, helping assess the impact of regularization on model performance. The results are plotted on a logarithmic scale for better visualization, showing how increasing alpha affects prediction error, typically revealing an optimal balance between underfitting and overfitting:

import matplotlib.pyplot as plt

# Extract alpha values and corresponding MSE scores
alphas = lasso_cv.alphas_
mse_values = lasso_cv.mse_path_.mean(axis=1)  # Average over cross-validation folds

# Plot MSE vs. alpha
plt.figure(figsize=(8, 6))
plt.plot(alphas, mse_values, marker='o', linestyle='-')
plt.xscale("log")  # Log scale for better visualization
plt.xlabel("Alpha")
plt.ylabel("Mean Squared Error")
plt.title("LassoCV: MSE vs. Alpha")
plt.grid(True)
plt.show()

Run in Colab

Output:

Adjust search range for alpha

You can adjust the range of alpha values in LassoCV by specifying a custom set of values. To search within the range of 10e-4 to 10e-2, modify the alphas parameter like this:

# Define a smaller range for alpha values
alpha_values = np.logspace(-4, -2, 100)  # 100 values between 10^-4 and 10^-2

# Perform LassoCV with the adjusted alpha range
lasso_cv = LassoCV(alphas=alpha_values, cv=10, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)

# Get best alpha
best_alpha = lasso_cv.alpha_
print(f"Best alpha selected by LassoCV: {best_alpha:.4f}")

Run in Colab

Output:

Best alpha selected by LassoCV: 0.0008
Lasso Model MSE: 0.5547

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!