Backward feature selection + example

Backward feature selection starts with the full model including all features and iteratively removes the least significant feature based on adjusted R-squared until no further improvement can be made.

Another example: amount of nuts collected by squirrels

Let’s take a practical example of backward feature selection in a forest environment, where the task is to predict the amount of nuts collected by squirrels based on several environmental features.

Goal: Predict the number of nuts collected by squirrels ( $Y$ ) based on several environmental features such as:

$X_1$ : Number of trees in the area
$X_2$ : Temperature in the forest
$X_3$ : Amount of rainfall
$X_4$ : Distance to the nearest water source
$X_5$ : Number of competing squirrels

The initial linear regression model would be:
$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon$
Where:

$Y$ is the number of nuts collected by squirrels,
$X_1, X_2, X_3, X_4, X_5$ are the features,
$\beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5$ are the regression coefficients,
$\epsilon$ is the error term.

Step-by-Step Process of Backward Feature Selection

Step 1: Train the Model with All Features
The initial model is trained using all the features:

$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$

Recall that adjusted R-squared adjusts the R-squared value to account for the number of predictors in the model, making it more suitable for comparing models with different numbers of features. The Adjusted R-squared is given by:
$\bar{R}^2 = 1 - \left( \frac{1 - R^2}{n - p - 1} \right) \times (n - 1)$
Where: $R^2$ is the R-squared value of the model, $n$ is the number of observations, $p$ is the number of predictors (features).

Procedure:

Step 1: Train the Model with All Features

Start with the model using all 5 features:
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$

Calculate the Adjusted R-squared for this model. Assume:
$\bar{R}^2_{\text{full}} = 0.75$

Step 2: Remove One Feature and Evaluate

Remove one feature at a time and recalculate the Adjusted R-squared for each model.

Model without $X_1$ :
$Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.72$ .
Model without $X_2$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.73$ .
Model without $X_3$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_4 X_4 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.60$ .
Model without $X_4$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_5 X_5$
Assume the Adjusted R-squared $\bar{R}^2 = 0.71$ .
Model without $X_5$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4$
Assume the Adjusted R-squared $\bar{R}^2 = 0.74$ .

Step 3: Choose the Feature to Remove

Compare the Adjusted R-squared values:

Removing $X_1$ : $\bar{R}^2 = 0.72$
Removing $X_2$ : $\bar{R}^2 = 0.73$
Removing $X_3$ : $\bar{R}^2 = 0.60$ (significant decrease)
Removing $X_4$ : $\bar{R}^2 = 0.71$
Removing $X_5$ : $\bar{R}^2 = 0.74$ (highest Adjusted R-squared)

Since removing $X_5$ results in the highest Adjusted R-squared, we remove $X_5$ from the model.

Step 4: Repeat the Process

With the remaining features $X_1, X_2, X_3, X_4$ , repeat the process:

Model without $X_1$ :
$Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4$
Assume Adjusted R-squared $\bar{R}^2 = 0.71$ .
Model without $X_2$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_3 X_3 + \hat{\beta}_4 X_4$
Assume Adjusted R-squared $\bar{R}^2 = 0.72$ .
Model without $X_3$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_4 X_4$
Assume Adjusted R-squared $\bar{R}^2 = 0.70$ .
Model without $X_4$ :
$Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3$
Assume Adjusted R-squared $\bar{R}^2 = 0.73$ .

Removing $X_1$ results in the highest Adjusted R-squared, so we remove $X_1$ next.

Step 5: Final Model

Continue until removing more features causes a significant drop in Adjusted R-squared.

Assuming after several iterations, the final model with the best Adjusted R-squared value is:
$Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3$

Summary of Steps

Train the model with all features and calculate Adjusted R-squared.
Remove one feature at a time, recalculate Adjusted R-squared, and select the feature that causes the least decrease.
Repeat the process with remaining features until no further improvement is possible.
Finalize the model with the features that yield the highest Adjusted R-squared.

Final Model: $Y = \hat{\beta}_0 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3$

In this process, the features Temperature ( $X_2$ ) and Amount of Rainfall ( $X_3$ ) are selected as the most significant predictors of the amount of nuts collected by squirrels.

Backward feature selection + example

Another example: amount of nuts collected by squirrels

Related

Leave a ReplyCancel reply

Backward feature selection + example

Another example: amount of nuts collected by squirrels

Share this:

Related

Leave a ReplyCancel reply