Logistic Regression: method + Python & R codes

June 23, 2024September 9, 2024by Kurious Fox

Subscribe to get access

??Subscribe to read the rest of the comics, the fun you can’t miss ??

Logistic regression is a statistical method used for analyzing datasets in which there are one or more independent variables that determine an outcome. The outcome is typically a binary variable, meaning it has two possible outcomes (e.g., success/failure, yes/no, purchased/not purchased). It is widely used in various fields such as medicine (disease diagnosis), finance (credit scoring), marketing (customer purchase prediction), and social sciences.

Formula: The logistic regression model predicts the probability $P(Y=1)$ as:
$P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n)}}$
where $\beta_0$ is the intercept, $\beta_1, \beta_2, \ldots, \beta_n$ are the coefficients, and $X_1, X_2, \ldots, X_n$ are the predictor variables.

Note that here, the logistic function (or sigmoid function) transforms the linear regression output to a value between 0 and 1:
$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)$

Interpretation:

The coefficients (betas) represent the change in the log odds of the outcome for a one-unit increase in the predictor variable.
The output probability can be converted to binary outcomes using a threshold, usually 0.5

Implementation in R & Python

Here are examples of logistic regression using both Python and R with a sample dataset, where we want to predict whether a customer will purchase a product (Yes/No) based on their age and salary.

Python

We’ll use the popular pandas and statsmodels libraries for logistic regression.

Install the required libraries:
If you haven’t already, install pandas, numpy, statsmodels, and scikit-learn libraries.

   pip install pandas numpy statsmodels scikit-learn

Python Code:

   import pandas as pd
   import numpy as np
   import statsmodels.api as sm
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import confusion_matrix, accuracy_score

   # Sample data
   data = {
       'age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61],
       'salary': [15000, 18000, 20000, 22000, 19000, 24000, 25000, 26000, 27000, 28000],
       'purchased': [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]
   }
   df = pd.DataFrame(data)

   # Defining the dependent and independent variables
   X = df[['age', 'salary']]
   y = df['purchased']

   # Adding a constant for the intercept term
   X = sm.add_constant(X)

   # Splitting the data into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

   # Logistic regression model
   model = sm.Logit(y_train, X_train)
   result = model.fit()

   # Summary of the model
   print(result.summary())

   # Making predictions
   y_pred = result.predict(X_test)
   y_pred = np.where(y_pred > 0.5, 1, 0)

   # Confusion Matrix and Accuracy
   cm = confusion_matrix(y_test, y_pred)
   accuracy = accuracy_score(y_test, y_pred)
   print('Confusion Matrix:\n', cm)
   print('Accuracy:', accuracy)

R

Install the required packages:
If you haven’t already, install the tidyverse package.

   install.packages("tidyverse")

R Code:

   # Load necessary libraries
   library(tidyverse)

   # Sample data
   data <- tibble(
     age = c(22, 25, 47, 52, 46, 56, 55, 60, 62, 61),
     salary = c(15000, 18000, 20000, 22000, 19000, 24000, 25000, 26000, 27000, 28000),
     purchased = c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1)
   )

   # Splitting the data into training and testing sets
   set.seed(42)
   train_indices <- sample(seq_len(nrow(data)), size = 0.7 * nrow(data))
   train_data <- data[train_indices, ]
   test_data <- data[-train_indices, ]

   # Logistic regression model
   model <- glm(purchased ~ age + salary, data = train_data, family = binomial)

   # Summary of the model
   summary(model)

   # Making predictions
   predictions <- predict(model, newdata = test_data, type = "response")
   predicted_classes <- ifelse(predictions > 0.5, 1, 0)

   # Confusion Matrix and Accuracy
   confusion_matrix <- table(test_data$purchased, predicted_classes)
   accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
   print(confusion_matrix)
   print(paste('Accuracy:', accuracy))

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Supervised learning: who’s supervising the forest?

Supervised learning involves training an algorithm on labeled data and pairing input with correct output. Unsupervised learning uses unlabeled data to find patterns. For example, predicting pizza delivery tips involves features like time, pizza type, distance, and tip history, with the goal of predicting tip outcomes.

example of stack generalization using K-Nearest Neighbors (KNN) and Random Forest + Python codes

This example demonstrates the basic steps of stack generalization with two classifiers (KNN and Random Forest) and a Logistic Regression…

Support Vector Machine + Python & R Codes

Support Vector Classifier (SVC) is a powerful algorithm for classification tasks, capable of handling linear and non-linear data using different kernel functions. It efficiently handles high-dimensional data for applications like image recognition and bioinformatics. Python and R codes demonstrate SVM usage for binary classification with breast cancer and mtcars datasets, respectively.

Logistic Regression: method + Python & R codes

Subscribe to get access

Implementation in R & Python

Python

R

Like this:

Related

Discover more from Science Comics

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

Subscribe to get access

Implementation in R & Python

Python

R

Share this:

Like this:

Related

Discover more from Science Comics

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply