
Subscribe to get access
Read more of this content when you subscribe today.
Akaike Information Criterion (AIC)
- Purpose: AIC is used to compare different statistical models and determine which one best explains the data while penalizing for model complexity.
- Calculation: AIC is calculated as:
whereis the number of parameters in the model, and
is the likelihood of the model given the data.
- Interpretation: Lower AIC values indicate a better model. It balances model fit and complexity, discouraging overfitting by penalizing additional parameters.
Bayesian Information Criterion (BIC)

- Purpose: BIC also aims to select the best model among a set of candidates, with a stronger penalty for complexity compared to AIC.
- Calculation: BIC is calculated as:
whereis the number of observations,
is the number of parameters, and
is the likelihood of the model.
- Interpretation: Lower BIC values indicate a better model. BIC penalizes models with more parameters more heavily than AIC, making it more likely to favor simpler models.
Comparison and Use in Feature Selection


- AIC vs. BIC: Both criteria are used to select among competing models by evaluating their goodness of fit while penalizing complexity. BIC tends to select simpler models than AIC because the penalty for additional parameters grows with the number of observations.
- Feature Selection: In feature selection, AIC and BIC help identify the most relevant features by comparing models with different subsets of features. The goal is to find a balance between model accuracy and simplicity, avoiding overfitting and underfitting.
By applying AIC and BIC in feature selection, we can make informed decisions about which features to include in their models, enhancing both performance and interpretability.
Codes:
We’ll use the popular Boston Housing dataset for both Python and R.
Python Implementation of BIC
We first implement with BIC, AIC is similar, and the codes will also be provided afterward.
First, we import the necessary libraries for statistical modeling, generating combinations, handling data, and loading the California Housing dataset.
import statsmodels.api as sm
import itertools
import pandas as pd
from sklearn.datasets import fetch_california_housing
We load the California Housing dataset into X
(features) and y
(target variable, which is the median house value).
# Load the California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedHouseVal')
Next, we add a constant term (intercept) to the features dataframe X
.
# Add a constant to the model (intercept)
X = sm.add_constant(X)
We define a function named fit_model
that fits an Ordinary Least Squares (OLS) regression model using the specified features and returns the Bayesian Information Criterion (BIC) value.
# Function to fit model and get BIC
def fit_model(features):
model = sm.OLS(y, X[list(features)]).fit() # Use list(features) to handle tuples
return model.bic
Then, we generate all possible combinations of features (excluding the constant term), fit the model for each combination, and record the BIC value for each model.
# Generate all possible combinations of features
features = X.columns.tolist()
features.remove('const') # Remove the constant term from the list of features
results = []
for k in range(1, len(features) + 1):
for combo in itertools.combinations(features, k):
combo = ('const',) + combo # Add constant term to the feature set
bic = fit_model(combo)
results.append((combo, bic))
We convert the results (combinations of features and their corresponding BIC values) into a DataFrame for easy handling and sorting.
# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=['Features', 'BIC'])
Finally, we sort the DataFrame by BIC values and display the best model based on BIC (the one with the lowest BIC value).
# Display the best model based on BIC
print("Best model based on BIC:")
print(results_df.sort_values(by='BIC').head(1))
Sorting and Selecting Features: sorts the results DataFrame by the BIC value and selects the features from the best model (the one with the lowest BIC):
sorted_df = results_df.sort_values(by='BIC')
best_features = sorted_df.iloc[0, 0]
If you print out the best features, you’ll see
('const', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'AveOccup', 'Latitude', 'Longitude')
Building and Summarizing the Model using the selected best features and then prints a summary of the model, including coefficients, R-squared values, p-values, and other statistics:
model = sm.OLS(y, X[list(best_features)]).fit()
print(model.summary())
That gives
Best model based on BIC: OLS Regression Results ============================================================================== Dep. Variable: MedHouseVal R-squared: 0.606 Model: OLS Adj. R-squared: 0.606 Method: Least Squares F-statistic: 4538. Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.00 Time: 15:25:22 Log-Likelihood: -22624. No. Observations: 20640 AIC: 4.526e+04 Df Residuals: 20632 BIC: 4.533e+04 Df Model: 7 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -36.9175 0.658 -56.085 0.000 -38.208 -35.627 MedInc 0.4368 0.004 104.089 0.000 0.429 0.445 HouseAge 0.0096 0.000 22.602 0.000 0.009 0.010 AveRooms -0.1071 0.006 -18.217 0.000 -0.119 -0.096 AveBedrms 0.6449 0.028 22.922 0.000 0.590 0.700 AveOccup -0.0038 0.000 -7.861 0.000 -0.005 -0.003 Latitude -0.4207 0.007 -58.763 0.000 -0.435 -0.407 Longitude -0.4340 0.008 -57.782 0.000 -0.449 -0.419 ============================================================================== Omnibus: 4406.193 Durbin-Watson: 0.885 Prob(Omnibus): 0.000 Jarque-Bera (JB): 14155.786 Skew: 1.084 Prob(JB): 0.00 Kurtosis: 6.429 Cond. No. 1.68e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.68e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Combined codes:
import statsmodels.api as sm
import itertools
import pandas as pd
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedHouseVal')
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Function to fit model and get BIC
def fit_model(features):
model = sm.OLS(y, X[list(features)]).fit() # Use list(features) to handle tuples
return model.bic
# Generate all possible combinations of features
features = X.columns.tolist()
features.remove('const') # Remove the constant term from the list of features
results = []
for k in range(1, len(features) + 1):
for combo in itertools.combinations(features, k):
combo = ('const',) + combo # Add constant term to the feature set
bic = fit_model(combo)
results.append((combo, bic))
# Convert results to DataFrame and sort the models based on BIC
results_df = pd.DataFrame(results, columns=['Features', 'BIC'])
sorted_df = results_df.sort_values(by='BIC')
# Display the best model based on BIC
print("Best model based on BIC:")
sorted_df = results_df.sort_values(by='BIC')
# the set of features that build up the best model according to BIC:
best_features = sorted_df.iloc[0, 0]
# Fit the model using the best features
model = sm.OLS(y, X[list(best_features)]).fit()
# Display the summary of the model
print(model.summary())
The code for AIC is similar
import statsmodels.api as sm
import itertools
import pandas as pd
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedHouseVal')
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Function to fit model and get AIC
def fit_model(features):
model = sm.OLS(y, X[list(features)]).fit() # Use list(features) to handle tuples
return model.aic
# Generate all possible combinations of features
features = X.columns.tolist()
features.remove('const') # Remove the constant term from the list of features
results = []
for k in range(1, len(features) + 1):
for combo in itertools.combinations(features, k):
combo = ('const',) + combo # Add constant term to the feature set
aic = fit_model(combo)
results.append((combo, aic))
# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=['Features', 'AIC'])
# Sort the DataFrame by AIC and get the best features
sorted_df = results_df.sort_values(by='AIC')
best_features = sorted_df.iloc[0, 0]
# Fit the model using the best features
model = sm.OLS(y, X[list(best_features)]).fit()
# Display the summary of the model
print(model.summary())
R Implementation
First, we check if the MASS
package is installed. If it’s not, we install it and load it into the R session.
# Load necessary package
if (!require("MASS")) install.packages("MASS", dependencies = TRUE)
library(MASS)
We load the mtcars
dataset, which is included in the datasets
package in R.
# Load the mtcars dataset
data(mtcars)
Next, we set up the full model (including all predictors) and the null model (including only the intercept).
# Automated stepwise model selection based on BIC
full_model <- lm(mpg ~ ., data = mtcars) # Full model with all predictors
null_model <- lm(mpg ~ 1, data = mtcars) # Null model with intercept only
We use stepwise model selection based on BIC to choose the best model. The step
function is used here to add or remove predictors to minimize the BIC. The scope argument defines the range of models evaluated, and k = log(nrow(mtcars)) sets the penalty for the number of parameters to the log of the number of observations.
selected_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both",
k = log(nrow(mtcars))) # k = log(n) for BIC
Finally, we print the summary of the selected model, which shows the details of the model chosen by the stepwise selection process.
# Summary of the selected model
print("Summary of the Selected Model:")
summary(selected_model)
Results:
Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2893 -1.5512 -0.4684 1.5743 6.1004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
wt -3.1910 0.7569 -4.216 0.000222 ***
cyl -1.5078 0.4147 -3.636 0.001064 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
Combined code:
# Load necessary package
if (!require("MASS")) install.packages("MASS", dependencies = TRUE)
library(MASS)
# Load the mtcars dataset
data(mtcars)
# Automated stepwise model selection based on BIC
full_model <- lm(mpg ~ ., data = mtcars) # Full model with all predictors
null_model <- lm(mpg ~ 1, data = mtcars) # Null model with intercept only
selected_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both",
k = log(nrow(mtcars))) # k = log(n) for BIC
# Summary of the selected model
print("Summary of the Selected Model:")
summary(selected_model)
AIC
For AIC, we can do similarly
# Load necessary package
if (!require("MASS")) install.packages("MASS", dependencies = TRUE)
library(MASS)
# Load the mtcars dataset
data(mtcars)
# Automated stepwise model selection based on AIC
full_model <- lm(mpg ~ ., data = mtcars) # Full model with all predictors
null_model <- lm(mpg ~ 1, data = mtcars) # Null model with intercept only
selected_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both") # Default uses AIC for selection
# Summary of the selected model
print("Summary of the Selected Model:")
summary(selected_model)
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.