Skip to content

Non-constant variance in linear regression: a duck’s mood swing problem

Let’s simulate some data in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.compat import lzip

# Step 1: Simulate Data with Heteroscedasticity
np.random.seed(42)
n = 100
X = np.linspace(1, 10, n)
# Create heteroscedasticity by increasing variance with X
Y = 3 * X + np.random.normal(scale=X, size=n)

# Convert to DataFrame
df = pd.DataFrame({'X': X, 'Y': Y})

# Step 2: Fit a Regression Model
X_sm = sm.add_constant(df['X'])  # Add constant term for intercept
model = sm.OLS(df['Y'], X_sm).fit()

# Step 3: Detect Heteroscedasticity
# Residual plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df['X'], model.resid)
plt.xlabel('X')
plt.ylabel('Residuals')
plt.title('Residual Plot')

So, we can see that there’s an increasing trend in the absolute values of the residual. Now, let’s use the log – transform to fix this

# Step 4: Apply a Fix (Log Transformation)
df['Y_log'] = np.log(df['Y'] - df['Y'].min() + 1)  # Shift Y to avoid log(0)

# Fit the model again with transformed Y
model_log = sm.OLS(df['Y_log'], X_sm).fit()

# Residual plot for transformed model
plt.subplot(1, 2, 2)
plt.scatter(df['X'], model_log.resid)
plt.xlabel('X')
plt.ylabel('Residuals (Log-Transformed)')
plt.title('Residual Plot (Log-Transformed)')
plt.tight_layout()
plt.show()

Now, we can see the issue is ameliorated in this transformed plot

See also  Understanding Common Types and Characteristics of Data

Leave a Reply

error: Content is protected !!