Non-constant variance in linear regression: a duck’s mood swing problem

Subscribe to get access

??Subscribe to read the rest of the comics, the fun you can’t miss ??

Let’s simulate some data in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.compat import lzip

# Step 1: Simulate Data with Heteroscedasticity
np.random.seed(42)
n = 100
X = np.linspace(1, 10, n)
# Create heteroscedasticity by increasing variance with X
Y = 3 * X + np.random.normal(scale=X, size=n)

# Convert to DataFrame
df = pd.DataFrame({'X': X, 'Y': Y})

# Step 2: Fit a Regression Model
X_sm = sm.add_constant(df['X'])  # Add constant term for intercept
model = sm.OLS(df['Y'], X_sm).fit()

# Step 3: Detect Heteroscedasticity
# Residual plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df['X'], model.resid)
plt.xlabel('X')
plt.ylabel('Residuals')
plt.title('Residual Plot')

So, we can see that there’s an increasing trend in the absolute values of the residual. Now, let’s use the log – transform to fix this

# Step 4: Apply a Fix (Log Transformation)
df['Y_log'] = np.log(df['Y'] - df['Y'].min() + 1)  # Shift Y to avoid log(0)

# Fit the model again with transformed Y
model_log = sm.OLS(df['Y_log'], X_sm).fit()

# Residual plot for transformed model
plt.subplot(1, 2, 2)
plt.scatter(df['X'], model_log.resid)
plt.xlabel('X')
plt.ylabel('Residuals (Log-Transformed)')
plt.title('Residual Plot (Log-Transformed)')
plt.tight_layout()
plt.show()

Now, we can see the issue is ameliorated in this transformed plot


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!