Backward feature selection + example

Subscribe to get access

??Subscribe to read the rest of the comics, the fun you can’t miss ??

Backward feature selection starts with the full model including all features and iteratively removes the least significant feature based on adjusted R-squared until no further improvement can be made.

Subscribe to get access

Read more of this content when you subscribe today.

Another example: amount of nuts collected by squirrels

Let’s take a practical example of backward feature selection in a forest environment, where the task is to predict the amount of nuts collected by squirrels based on several environmental features.

Goal: Predict the number of nuts collected by squirrels ( $Y$ ) based on several environmental features such as:

$X_1$ : Number of trees in the area
$X_2$ : Temperature in the forest
$X_3$ : Amount of rainfall
$X_4$ : Distance to the nearest water source
$X_5$ : Number of competing squirrels

The initial linear regression model would be:
$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon$
Where:

$Y$ is the number of nuts collected by squirrels,
$X_1, X_2, X_3, X_4, X_5$ are the features,
$\beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5$ are the regression coefficients,
$\epsilon$ is the error term.

Step-by-Step Process of Backward Feature Selection

Step 1: Train the Model with All Features
The initial model is trained using all the features:

$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$

Recall that adjusted R-squared adjusts the R-squared value to account for the number of predictors in the model, making it more suitable for comparing models with different numbers of features. The Adjusted R-squared is given by:
$\bar{R}^2 = 1 - \left( \frac{1 - R^2}{n - p - 1} \right) \times (n - 1)$
Where: $R^2$ is the R-squared value of the model, $n$ is the number of observations, $p$ is the number of predictors (features).

Procedure:

Step 1: Train the Model with All Features

Start with the model using all 5 features:
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$

Calculate the Adjusted R-squared for this model. Assume:
$\bar{R}^2_{\text{full}} = 0.75$

Step 2: Remove One Feature and Evaluate

Remove one feature at a time and recalculate the Adjusted R-squared for each model.

Model without $X_1$ :
$Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.72$ .
Model without $X_2$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.73$ .
Model without $X_3$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.60$ .
Model without $X_4$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.71$ .
Model without $X_5$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4$
Assume the Adjusted R-squared $\bar{R}^2 = 0.74$ .

Step 3: Choose the Feature to Remove

Compare the Adjusted R-squared values:

Removing $X_1$ : $\bar{R}^2 = 0.72$
Removing $X_2$ : $\bar{R}^2 = 0.73$
Removing $X_3$ : $\bar{R}^2 = 0.60$ (significant decrease)
Removing $X_4$ : $\bar{R}^2 = 0.71$
Removing $X_5$ : $\bar{R}^2 = 0.74$ (highest Adjusted R-squared)

Since removing $X_5$ results in the highest Adjusted R-squared, we remove $X_5$ from the model.

Step 4: Repeat the Process

With the remaining features $X_1, X_2, X_3, X_4$ , repeat the process:

Model without $X_1$ :
$Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4$
Assume Adjusted R-squared $\bar{R}^2 = 0.71$ .
Model without $X_2$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4$
Assume Adjusted R-squared $\bar{R}^2 = 0.72$ .
Model without $X_3$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_4 X_4$
Assume Adjusted R-squared $\bar{R}^2 = 0.70$ .
Model without $X_4$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3$
Assume Adjusted R-squared $\bar{R}^2 = 0.73$ .

Removing $X_1$ results in the highest Adjusted R-squared, so we remove $X_1$ next.

Step 5: Final Model

Continue until removing more features causes a significant drop in Adjusted R-squared.

Assuming after several iterations, the final model with the best Adjusted R-squared value is:
$Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3$

Summary of Steps

Train the model with all features and calculate Adjusted R-squared.
Remove one feature at a time, recalculate Adjusted R-squared, and select the feature that causes the least decrease.
Repeat the process with remaining features until no further improvement is possible.
Finalize the model with the features that yield the highest Adjusted R-squared.

Final Model: $Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3$

In this process, the features Temperature ( $X_2$ ) and Amount of Rainfall ( $X_3$ ) are selected as the most significant predictors of the amount of nuts collected by squirrels.

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Backward feature selection + example

Subscribe to get access

Subscribe to get access

Another example: amount of nuts collected by squirrels

Like this:

Related

Discover more from Science Comics

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

Subscribe to get access

Subscribe to get access

Another example: amount of nuts collected by squirrels

Share this:

Like this:

Related

Discover more from Science Comics

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply