AIC and BIC for Feature Selection

Subscribe to get access

Read more of this content when you subscribe today.

Akaike Information Criterion (AIC)

Purpose: AIC is used to compare different statistical models and determine which one best explains the data while penalizing for model complexity.
Calculation: AIC is calculated as:
$\text{AIC} = 2k - 2\ln(L)$
where $k$ is the number of parameters in the model, and $L$ is the likelihood of the model given the data.
Interpretation: Lower AIC values indicate a better model. It balances model fit and complexity, discouraging overfitting by penalizing additional parameters.

Bayesian Information Criterion (BIC)

Purpose: BIC also aims to select the best model among a set of candidates, with a stronger penalty for complexity compared to AIC.
Calculation: BIC is calculated as:
$\text{BIC} = k\ln(n) - 2\ln(L)$
where $n$ is the number of observations, $k$ is the number of parameters, and $L$ is the likelihood of the model.
Interpretation: Lower BIC values indicate a better model. BIC penalizes models with more parameters more heavily than AIC, making it more likely to favor simpler models.

Comparison and Use in Feature Selection

AIC vs. BIC: Both criteria are used to select among competing models by evaluating their goodness of fit while penalizing complexity. BIC tends to select simpler models than AIC because the penalty for additional parameters grows with the number of observations.
Feature Selection: In feature selection, AIC and BIC help identify the most relevant features by comparing models with different subsets of features. The goal is to find a balance between model accuracy and simplicity, avoiding overfitting and underfitting.

By applying AIC and BIC in feature selection, we can make informed decisions about which features to include in their models, enhancing both performance and interpretability.

Codes:

We’ll use the popular Boston Housing dataset for both Python and R.

Python Implementation of BIC

We first implement with BIC, AIC is similar, and the codes will also be provided afterward.

First, we import the necessary libraries for statistical modeling, generating combinations, handling data, and loading the California Housing dataset.

import statsmodels.api as sm
import itertools
import pandas as pd
from sklearn.datasets import fetch_california_housing

We load the California Housing dataset into X (features) and y (target variable, which is the median house value).

# Load the California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedHouseVal')

Next, we add a constant term (intercept) to the features dataframe X.

# Add a constant to the model (intercept)
X = sm.add_constant(X)

We define a function named fit_model that fits an Ordinary Least Squares (OLS) regression model using the specified features and returns the Bayesian Information Criterion (BIC) value.

# Function to fit model and get BIC
def fit_model(features):
    model = sm.OLS(y, X[list(features)]).fit()  # Use list(features) to handle tuples
    return model.bic

Then, we generate all possible combinations of features (excluding the constant term), fit the model for each combination, and record the BIC value for each model.

# Generate all possible combinations of features
features = X.columns.tolist()
features.remove('const')  # Remove the constant term from the list of features
results = []

for k in range(1, len(features) + 1):
    for combo in itertools.combinations(features, k):
        combo = ('const',) + combo  # Add constant term to the feature set
        bic = fit_model(combo)
        results.append((combo, bic))

We convert the results (combinations of features and their corresponding BIC values) into a DataFrame for easy handling and sorting.

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=['Features', 'BIC'])

Finally, we sort the DataFrame by BIC values and display the best model based on BIC (the one with the lowest BIC value).

# Display the best model based on BIC
print("Best model based on BIC:")
print(results_df.sort_values(by='BIC').head(1))

Sorting and Selecting Features: sorts the results DataFrame by the BIC value and selects the features from the best model (the one with the lowest BIC):

sorted_df = results_df.sort_values(by='BIC')
best_features = sorted_df.iloc[0, 0]

If you print out the best features, you’ll see

('const',
 'MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'AveOccup',
 'Latitude',
 'Longitude')

Building and Summarizing the Model using the selected best features and then prints a summary of the model, including coefficients, R-squared values, p-values, and other statistics:

model = sm.OLS(y, X[list(best_features)]).fit()
print(model.summary())

That gives

Best model based on BIC:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            MedHouseVal   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     4538.
Date:                Sun, 24 Nov 2024   Prob (F-statistic):               0.00
Time:                        15:25:22   Log-Likelihood:                -22624.
No. Observations:               20640   AIC:                         4.526e+04
Df Residuals:                   20632   BIC:                         4.533e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -36.9175      0.658    -56.085      0.000     -38.208     -35.627
MedInc         0.4368      0.004    104.089      0.000       0.429       0.445
HouseAge       0.0096      0.000     22.602      0.000       0.009       0.010
AveRooms      -0.1071      0.006    -18.217      0.000      -0.119      -0.096
AveBedrms      0.6449      0.028     22.922      0.000       0.590       0.700
AveOccup      -0.0038      0.000     -7.861      0.000      -0.005      -0.003
Latitude      -0.4207      0.007    -58.763      0.000      -0.435      -0.407
Longitude     -0.4340      0.008    -57.782      0.000      -0.449      -0.419
==============================================================================
Omnibus:                     4406.193   Durbin-Watson:                   0.885
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            14155.786
Skew:                           1.084   Prob(JB):                         0.00
Kurtosis:                       6.429   Cond. No.                     1.68e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.68e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Combined codes:

import statsmodels.api as sm
import itertools
import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedHouseVal')

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Function to fit model and get BIC
def fit_model(features):
    model = sm.OLS(y, X[list(features)]).fit()  # Use list(features) to handle tuples
    return model.bic

# Generate all possible combinations of features
features = X.columns.tolist()
features.remove('const')  # Remove the constant term from the list of features
results = []

for k in range(1, len(features) + 1):
    for combo in itertools.combinations(features, k):
        combo = ('const',) + combo  # Add constant term to the feature set
        bic = fit_model(combo)
        results.append((combo, bic))

# Convert results to DataFrame and sort the models based on BIC
results_df = pd.DataFrame(results, columns=['Features', 'BIC'])
sorted_df = results_df.sort_values(by='BIC')

# Display the best model based on BIC
print("Best model based on BIC:")
sorted_df = results_df.sort_values(by='BIC')

# the set of features that build up the best model according to BIC:
best_features = sorted_df.iloc[0, 0]

# Fit the model using the best features
model = sm.OLS(y, X[list(best_features)]).fit()

# Display the summary of the model
print(model.summary())

Download codes

The code for AIC is similar

import statsmodels.api as sm
import itertools
import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedHouseVal')

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Function to fit model and get AIC
def fit_model(features):
    model = sm.OLS(y, X[list(features)]).fit()  # Use list(features) to handle tuples
    return model.aic

# Generate all possible combinations of features
features = X.columns.tolist()
features.remove('const')  # Remove the constant term from the list of features
results = []

for k in range(1, len(features) + 1):
    for combo in itertools.combinations(features, k):
        combo = ('const',) + combo  # Add constant term to the feature set
        aic = fit_model(combo)
        results.append((combo, aic))

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=['Features', 'AIC'])

# Sort the DataFrame by AIC and get the best features
sorted_df = results_df.sort_values(by='AIC')
best_features = sorted_df.iloc[0, 0]

# Fit the model using the best features
model = sm.OLS(y, X[list(best_features)]).fit()

# Display the summary of the model
print(model.summary())

Download codes

R Implementation

First, we check if the MASS package is installed. If it’s not, we install it and load it into the R session.

# Load necessary package
if (!require("MASS")) install.packages("MASS", dependencies = TRUE)
library(MASS)

We load the mtcars dataset, which is included in the datasets package in R.

# Load the mtcars dataset
data(mtcars)

Next, we set up the full model (including all predictors) and the null model (including only the intercept).

# Automated stepwise model selection based on BIC
full_model <- lm(mpg ~ ., data = mtcars)   # Full model with all predictors
null_model <- lm(mpg ~ 1, data = mtcars)  # Null model with intercept only

We use stepwise model selection based on BIC to choose the best model. The step function is used here to add or remove predictors to minimize the BIC. The scope argument defines the range of models evaluated, and k = log(nrow(mtcars)) sets the penalty for the number of parameters to the log of the number of observations.

selected_model <- step(null_model, 
                       scope = list(lower = null_model, upper = full_model), 
                       direction = "both", 
                       k = log(nrow(mtcars)))  # k = log(n) for BIC

Finally, we print the summary of the selected model, which shows the details of the model chosen by the stepwise selection process.

# Summary of the selected model
print("Summary of the Selected Model:")
summary(selected_model)

Results:

Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
wt           -3.1910     0.7569  -4.216 0.000222 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,	Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Combined code:

# Load necessary package
if (!require("MASS")) install.packages("MASS", dependencies = TRUE)
library(MASS)

# Load the mtcars dataset
data(mtcars)

# Automated stepwise model selection based on BIC
full_model <- lm(mpg ~ ., data = mtcars)   # Full model with all predictors
null_model <- lm(mpg ~ 1, data = mtcars)  # Null model with intercept only

selected_model <- step(null_model, 
                       scope = list(lower = null_model, upper = full_model), 
                       direction = "both", 
                       k = log(nrow(mtcars)))  # k = log(n) for BIC

# Summary of the selected model
print("Summary of the Selected Model:")
summary(selected_model)

Download codes

AIC

For AIC, we can do similarly

# Load necessary package
if (!require("MASS")) install.packages("MASS", dependencies = TRUE)
library(MASS)

# Load the mtcars dataset
data(mtcars)

# Automated stepwise model selection based on AIC
full_model <- lm(mpg ~ ., data = mtcars)   # Full model with all predictors
null_model <- lm(mpg ~ 1, data = mtcars)  # Null model with intercept only

selected_model <- step(null_model, 
                       scope = list(lower = null_model, upper = full_model), 
                       direction = "both")  # Default uses AIC for selection

# Summary of the selected model
print("Summary of the Selected Model:")
summary(selected_model)

Download codes

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

AIC and BIC for Feature Selection

Subscribe to get access

Codes:

Python Implementation of BIC

R Implementation

AIC

Like this:

Related

Discover more from Science Comics

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

Subscribe to get access

Codes:

Python Implementation of BIC

R Implementation

AIC

Share this:

Like this:

Related

Discover more from Science Comics

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply