Logistic Regression: method + Python & R codes

Subscribe to get access

??Subscribe to read the rest of the comics, the fun you can’t miss ??

Logistic regression is a statistical method used for analyzing datasets in which there are one or more independent variables that determine an outcome. The outcome is typically a binary variable, meaning it has two possible outcomes (e.g., success/failure, yes/no, purchased/not purchased). It is widely used in various fields such as medicine (disease diagnosis), finance (credit scoring), marketing (customer purchase prediction), and social sciences.

Formula: The logistic regression model predicts the probability P(Y=1) as:
P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n)}}
where \beta_0 is the intercept, \beta_1, \beta_2, \ldots, \beta_n are the coefficients, and X_1, X_2, \ldots, X_n are the predictor variables.

Note that here, the logistic function (or sigmoid function) transforms the linear regression output to a value between 0 and 1:
\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)

Interpretation:

  • The coefficients (betas) represent the change in the log odds of the outcome for a one-unit increase in the predictor variable.
  • The output probability can be converted to binary outcomes using a threshold, usually 0.5

Implementation in R & Python

Here are examples of logistic regression using both Python and R with a sample dataset, where we want to predict whether a customer will purchase a product (Yes/No) based on their age and salary.

Python

We’ll use the popular pandas and statsmodels libraries for logistic regression.

  1. Install the required libraries:
    If you haven’t already, install pandas, numpy, statsmodels, and scikit-learn libraries.
   pip install pandas numpy statsmodels scikit-learn
  1. Python Code:
   import pandas as pd
   import numpy as np
   import statsmodels.api as sm
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import confusion_matrix, accuracy_score

   # Sample data
   data = {
       'age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61],
       'salary': [15000, 18000, 20000, 22000, 19000, 24000, 25000, 26000, 27000, 28000],
       'purchased': [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]
   }
   df = pd.DataFrame(data)

   # Defining the dependent and independent variables
   X = df[['age', 'salary']]
   y = df['purchased']

   # Adding a constant for the intercept term
   X = sm.add_constant(X)

   # Splitting the data into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

   # Logistic regression model
   model = sm.Logit(y_train, X_train)
   result = model.fit()

   # Summary of the model
   print(result.summary())

   # Making predictions
   y_pred = result.predict(X_test)
   y_pred = np.where(y_pred > 0.5, 1, 0)

   # Confusion Matrix and Accuracy
   cm = confusion_matrix(y_test, y_pred)
   accuracy = accuracy_score(y_test, y_pred)
   print('Confusion Matrix:\n', cm)
   print('Accuracy:', accuracy)

R

  1. Install the required packages:
    If you haven’t already, install the tidyverse package.
   install.packages("tidyverse")
  1. R Code:
   # Load necessary libraries
   library(tidyverse)

   # Sample data
   data <- tibble(
     age = c(22, 25, 47, 52, 46, 56, 55, 60, 62, 61),
     salary = c(15000, 18000, 20000, 22000, 19000, 24000, 25000, 26000, 27000, 28000),
     purchased = c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1)
   )

   # Splitting the data into training and testing sets
   set.seed(42)
   train_indices <- sample(seq_len(nrow(data)), size = 0.7 * nrow(data))
   train_data <- data[train_indices, ]
   test_data <- data[-train_indices, ]

   # Logistic regression model
   model <- glm(purchased ~ age + salary, data = train_data, family = binomial)

   # Summary of the model
   summary(model)

   # Making predictions
   predictions <- predict(model, newdata = test_data, type = "response")
   predicted_classes <- ifelse(predictions > 0.5, 1, 0)

   # Confusion Matrix and Accuracy
   confusion_matrix <- table(test_data$purchased, predicted_classes)
   accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
   print(confusion_matrix)
   print(paste('Accuracy:', accuracy))

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!