Subscribe to get access
??Subscribe to read the rest of the comics, the fun you can’t miss ??
Logistic regression is a statistical method used for analyzing datasets in which there are one or more independent variables that determine an outcome. The outcome is typically a binary variable, meaning it has two possible outcomes (e.g., success/failure, yes/no, purchased/not purchased). It is widely used in various fields such as medicine (disease diagnosis), finance (credit scoring), marketing (customer purchase prediction), and social sciences.
Formula: The logistic regression model predicts the probability as:
where is the intercept,
are the coefficients, and
are the predictor variables.
Note that here, the logistic function (or sigmoid function) transforms the linear regression output to a value between 0 and 1:
Interpretation:
- The coefficients (betas) represent the change in the log odds of the outcome for a one-unit increase in the predictor variable.
- The output probability can be converted to binary outcomes using a threshold, usually 0.5
Implementation in R & Python
Here are examples of logistic regression using both Python and R with a sample dataset, where we want to predict whether a customer will purchase a product (Yes/No) based on their age and salary.
Python
We’ll use the popular pandas
and statsmodels
libraries for logistic regression.
- Install the required libraries:
If you haven’t already, installpandas
,numpy
,statsmodels
, andscikit-learn
libraries.
pip install pandas numpy statsmodels scikit-learn
- Python Code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
# Sample data
data = {
'age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61],
'salary': [15000, 18000, 20000, 22000, 19000, 24000, 25000, 26000, 27000, 28000],
'purchased': [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Defining the dependent and independent variables
X = df[['age', 'salary']]
y = df['purchased']
# Adding a constant for the intercept term
X = sm.add_constant(X)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Logistic regression model
model = sm.Logit(y_train, X_train)
result = model.fit()
# Summary of the model
print(result.summary())
# Making predictions
y_pred = result.predict(X_test)
y_pred = np.where(y_pred > 0.5, 1, 0)
# Confusion Matrix and Accuracy
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
print('Confusion Matrix:\n', cm)
print('Accuracy:', accuracy)
R
- Install the required packages:
If you haven’t already, install thetidyverse
package.
install.packages("tidyverse")
- R Code:
# Load necessary libraries
library(tidyverse)
# Sample data
data <- tibble(
age = c(22, 25, 47, 52, 46, 56, 55, 60, 62, 61),
salary = c(15000, 18000, 20000, 22000, 19000, 24000, 25000, 26000, 27000, 28000),
purchased = c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1)
)
# Splitting the data into training and testing sets
set.seed(42)
train_indices <- sample(seq_len(nrow(data)), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# Logistic regression model
model <- glm(purchased ~ age + salary, data = train_data, family = binomial)
# Summary of the model
summary(model)
# Making predictions
predictions <- predict(model, newdata = test_data, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Confusion Matrix and Accuracy
confusion_matrix <- table(test_data$purchased, predicted_classes)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(confusion_matrix)
print(paste('Accuracy:', accuracy))
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.