Backward feature selection + example

Subscribe to get access

??Subscribe to read the rest of the comics, the fun you can’t miss ??

Backward feature selection starts with the full model including all features and iteratively removes the least significant feature based on adjusted R-squared until no further improvement can be made.

Subscribe to get access

Read more of this content when you subscribe today.

Another example: amount of nuts collected by squirrels

Let’s take a practical example of backward feature selection in a forest environment, where the task is to predict the amount of nuts collected by squirrels based on several environmental features.

Goal: Predict the number of nuts collected by squirrels (Y) based on several environmental features such as:

  • X_1: Number of trees in the area
  • X_2: Temperature in the forest
  • X_3: Amount of rainfall
  • X_4: Distance to the nearest water source
  • X_5: Number of competing squirrels

The initial linear regression model would be:
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon
Where:

  • Y is the number of nuts collected by squirrels,
  • X_1, X_2, X_3, X_4, X_5 are the features,
  • \beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5 are the regression coefficients,
  • \epsilon is the error term.

Step-by-Step Process of Backward Feature Selection

Step 1: Train the Model with All Features
The initial model is trained using all the features:

Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5

Recall that adjusted R-squared adjusts the R-squared value to account for the number of predictors in the model, making it more suitable for comparing models with different numbers of features. The Adjusted R-squared is given by:
\bar{R}^2 = 1 - \left( \frac{1 - R^2}{n - p - 1} \right) \times (n - 1)
Where: R^2 is the R-squared value of the model, n is the number of observations, p is the number of predictors (features).

Procedure:

Step 1: Train the Model with All Features

Start with the model using all 5 features:
Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5

Calculate the Adjusted R-squared for this model. Assume:
\bar{R}^2_{\text{full}} = 0.75

Step 2: Remove One Feature and Evaluate

Remove one feature at a time and recalculate the Adjusted R-squared for each model.

  1. Model without X_1:
    Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5
    Assume the Adjusted R-squared \bar{R}^2 = 0.72.
  2. Model without X_2:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5
    Assume the Adjusted R-squared \bar{R}^2 = 0.73.
  3. Model without X_3:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5
    Assume the Adjusted R-squared \bar{R}^2 = 0.60.
  4. Model without X_4:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_5 X_5
    Assume the Adjusted R-squared \bar{R}^2 = 0.71.
  5. Model without X_5:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4
    Assume the Adjusted R-squared \bar{R}^2 = 0.74.

Step 3: Choose the Feature to Remove

Compare the Adjusted R-squared values:

  • Removing X_1: \bar{R}^2 = 0.72
  • Removing X_2: \bar{R}^2 = 0.73
  • Removing X_3: \bar{R}^2 = 0.60 (significant decrease)
  • Removing X_4: \bar{R}^2 = 0.71
  • Removing X_5: \bar{R}^2 = 0.74 (highest Adjusted R-squared)

Since removing X_5 results in the highest Adjusted R-squared, we remove X_5 from the model.

Step 4: Repeat the Process

With the remaining features X_1, X_2, X_3, X_4, repeat the process:

  1. Model without X_1:
    Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4
    Assume Adjusted R-squared \bar{R}^2 = 0.71.
  2. Model without X_2:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4
    Assume Adjusted R-squared \bar{R}^2 = 0.72.
  3. Model without X_3:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_4 X_4
    Assume Adjusted R-squared \bar{R}^2 = 0.70.
  4. Model without X_4:
    Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3
    Assume Adjusted R-squared \bar{R}^2 = 0.73.

Removing X_1 results in the highest Adjusted R-squared, so we remove X_1 next.

Step 5: Final Model

Continue until removing more features causes a significant drop in Adjusted R-squared.

Assuming after several iterations, the final model with the best Adjusted R-squared value is:
Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3

Summary of Steps

  1. Train the model with all features and calculate Adjusted R-squared.
  2. Remove one feature at a time, recalculate Adjusted R-squared, and select the feature that causes the least decrease.
  3. Repeat the process with remaining features until no further improvement is possible.
  4. Finalize the model with the features that yield the highest Adjusted R-squared.

Final Model: Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3

In this process, the features Temperature (X_2) and Amount of Rainfall (X_3) are selected as the most significant predictors of the amount of nuts collected by squirrels.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!