Multiple regression analysis: waiting time to log in to Windows

Multiple regression analysis can be used to understand the relationship between the waiting time to log in to Windows (dependent variable) and several independent variables. Let’s assume we have the following independent variables:

  1. Number of startup applications: The number of applications that start automatically when Windows boots up.
  2. System RAM (in GB): The amount of RAM installed in the system.
  3. Processor speed (in GHz): The speed of the system’s processor.
  4. Disk speed (in MB/s): The speed of the system’s hard drive or SSD.

Suppose that we have a toy dataset like this:

Waiting Time (s)Startup ApplicationsRAM (GB)Processor Speed (GHz)Disk Speed (MB/s)
451082.5150
305163.0500
601542.0100
25383.5250
40883.0200

Multiple Regression Analysis

The general form of the multiple regression equation is:

\text{Waiting Time} = \beta_0 + \beta_1 \times \text{Startup Applications} + \beta_2 \times \text{RAM} + \beta_3 \times \text{Processor Speed} + \beta_4 \times \text{Disk Speed} + \epsilon

Where:

  • \beta_0 is the intercept,
  • \beta_1, \beta_2, \beta_3, and \beta_4 are the coefficients of the independent variables,
  • \epsilon is the error term.

  1. Matrix Form Representation:

The multiple regression model can be represented in matrix form as:

\mathbf{Y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}

Where:

  • \mathbf{Y} is the vector of the dependent variable (Waiting Time).
  • \mathbf{X} is the matrix of independent variables (including the intercept term).
  • \mathbf{\beta} is the vector of coefficients.
  • \mathbf{\epsilon} is the vector of errors.

Given the dataset:

\mathbf{Y} = \begin{pmatrix}45 \\30 \\60 \\25 \\40\end{pmatrix},
\mathbf{X} = \begin{pmatrix}1 & 10 & 8 & 2.5 & 150 \\1 & 5 & 16 & 3.0 &500\\1 & 15 & 4 & 2.0 & 100 \\1 & 3 & 8 & 3.5 & 250 \\1 & 8 & 8 & 3.0 & 200 \end{pmatrix}

  1. Computing the Coefficients:

The coefficients can be computed using the Normal Equation:

\mathbf{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}

Let’s compute this using Python:

import numpy as np
# Define the dependent variable vector Y
Y = np.array([45, 30, 60, 25, 40])
# Define the independent variable matrix X
X = np.array([
    [1, 10, 8, 2.5, 150],
    [1, 5, 16, 3.0, 500],
    [1, 15, 4, 2.0, 100],
    [1, 3, 8, 3.5, 250],
    [1, 8, 8, 3.0, 200]
])
# Compute the coefficients using the Normal Equation
beta = np.linalg.inv(X.T @ X) @ X.T @ Y
beta

The computed coefficients \mathbf{\beta} are:

\mathbf{\beta} = \begin{pmatrix} 3.33333333 \\3.33333333 \\1.74527059 \times 10^{-13} \\3.33333333 \-5.86336535 \times 10^{-16}\end{pmatrix}

Interpreting the coefficients:

  • Intercept (\beta_0): 3.33333333
  • Startup Applications (\beta_1): 3.33333333
  • RAM (\beta_2): 1.74527059 \times 10^{-13} (approximately zero, indicating no significant effect)
  • Processor Speed (\beta_3): 3.33333333
  • Disk Speed (\beta_4): -5.86336535 \times 10^{-16} (approximately zero, indicating no significant effect)

This indicates that the waiting time to log in to Windows is significantly affected by the number of startup applications and the processor speed, while the a mount of RAM and disk speed do not show a significant effect in this model.

Python Example using Statsmodels

Here’s an example of how to perform this regression analysis in Python:

import pandas as pd
import statsmodels.api as sm
# Sample data
data = {
    'Waiting Time': [45, 30, 60, 25, 40],
    'Startup Applications': [10, 5, 15, 3, 8],
    'RAM': [8, 16, 4, 8, 8],
    'Processor Speed': [2.5, 3.0, 2.0, 3.5, 3.0],
    'Disk Speed': [150, 500, 100, 250, 200]
}
df = pd.DataFrame(data)
# Define the dependent and independent variables
X = df[['Startup Applications', 'RAM', 'Processor Speed', 'Disk Speed']]
y = df['Waiting Time']
# Add a constant to the independent variables
X = sm.add_constant(X)
# Fit the regression model
model = sm.OLS(y, X).fit()
# Print the model summary
print(model.summary())

Interpreting Results

The output will provide various statistics, including the coefficients (\beta values), p-values, R-squared value, and more. The coefficients indicate the expected change in the waiting time for a one-unit change in the respective independent variable, holding all other variables constant.

  • R-squared: Indicates how well the independent variables explain the variation in the dependent variable.
  • Coefficients: Represent the magnitude and direction of the relationship between each independent variable and the dependent variable.
  • P-values: Help determine the statistical significance of each coefficient. A common threshold for significance is 0.05.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!