Using pipelines in Python/R to improve coding efficiency & readability

Pipelines in Python

In Python, pipelines can be used to structure your code more efficiently, making it more readable and maintainable. You can create pipelines using the pipe() method available in libraries like pandas, or by using the pipeline class in libraries like scikit-learn.

Example: Using Pandas pipe()

The pipe() method in pandas allows you to chain multiple functions together in a sequence, making the data processing steps more readable.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Define custom functions
def add_column(df, col_name, value):
    df[col_name] = value
    return df

def multiply_column(df, col_name, factor):
    df[col_name] = df[col_name] * factor
    return df

# Using pipe to chain functions
result = (df.pipe(add_column, 'C', 5)
           .pipe(multiply_column, 'B', 2)
           .pipe(multiply_column, 'C', 3))

print(result)

Run in Colab

Output:

   A   B   C
0  1  20  15
1  2  40  15
2  3  60  15
3  4  80  15

Advantages:

  • Readability: Each transformation is clearly separated and described.
  • Reusability: Functions can be reused in different pipelines.
  • Debugging: Easier to isolate and test each function.

Example: Using Scikit-Learn Pipelines

In machine learning, the Pipeline class from scikit-learn is commonly used to chain preprocessing steps and estimators.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Fit the model
pipeline.fit(X_train, y_train)

# Predict
predictions = pipeline.predict(X_test)

Run in Colab

Advantages:

  • Modularity: Different preprocessing steps and models are encapsulated.
  • Consistency: Training and prediction workflows remain consistent.
  • Cross-validation: Pipelines can be easily integrated with cross-validation methods.

Example 2: Using pipeline with Cross Validation and Grid Search

In the following codes, we will use the breast cancer dataset to:

  1. Split the data into training and validation sets using train_test_split.
  2. Standardize the data using StandardScaler.
  3. Impute missing values using KNN imputation (KNNImputer).
  4. Classify the data using KNN classification (KNeighborsClassifier).
  5. Perform hyperparameter tuning using GridSearchCV to find the optimal number of neighbors both for imputation and classification.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# Load a real dataset from scikit-learn (e.g., Breast Cancer dataset)
data = load_breast_cancer()
X, y = data.data, data.target

# Introduce missing data (simulate 20% missing rate)
np.random.seed(42)  # For reproducibility
missing_rate = 0.2
n_samples, n_features = X.shape
n_missing_samples = int(np.floor(n_samples * n_features * missing_rate))

# Randomly choose indices to set as NaN
missing_indices = (
    np.random.choice(n_samples, n_missing_samples),
    np.random.choice(n_features, n_missing_samples)
)
X[missing_indices] = np.nan

# Debug: Check the amount of missing data
print(f"Total missing values: {np.isnan(X).sum()} out of {n_samples * n_features}")

# Split data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a pipeline for imputation, scaling, and classification
pipeline = Pipeline([
    ('imputer', KNNImputer()),  # Step 1: Impute missing data using KNN
    ('scaler', StandardScaler()),  # Step 2: Standardize data
    ('classifier', KNeighborsClassifier())  # Step 3: Classify using KNN
])

# Hyperparameters to tune
param_grid = {
    'imputer__n_neighbors': [2, 3, 4, 5, 6],  # Neighbors for KNN Imputation
    'classifier__n_neighbors': [3, 5, 7, 9, 11]  # Neighbors for KNN Classification
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')

# Fit the pipeline on the training data
grid_search.fit(X_train, y_train)

# Best hyperparameters found
print("Best hyperparameters:", grid_search.best_params_)

# Predict on validation set
y_pred = grid_search.predict(X_val)

# Evaluate the accuracy on the validation set
accuracy = accuracy_score(y_val, y_pred)
print("Validation accuracy:", accuracy)

Run in Colab

Output:

Total missing values: 3102 out of 17070
Best hyperparameters: {'classifier__n_neighbors': 9, 'imputer__n_neighbors': 6}
Validation accuracy: 0.9649122807017544

Example 3: Compare MinMaxScaler followed by StandardScaler versus StandardScaler alone:

Here’s a Python script for comparing the performance of preprocessing California housing data using both the MinMaxScaler followed by StandardScaler versus StandardScaler alone. This comparison is done using cross-validation (CV). The code uses the sklearn library to handle preprocessing, regression models, and evaluation:

  1. Dataset: The fetch_california_housing function loads the California housing dataset.
  2. Pipelines: Two pipelines are defined. The first applies MinMaxScaler followed by StandardScaler, and the second applies StandardScaler alone. Both pipelines include a LinearRegression as the model.
  3. Cross-Validation: The cross_val_score function computes the R² scores for each fold of the cross-validation process.
  4. Comparison: The mean R² scores of both pipelines are printed for comparison.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Load California housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Define pipelines
pipeline_minmax_then_standard = Pipeline([
    ('minmax', MinMaxScaler()),
    ('standard', StandardScaler()),
    ('model', LinearRegression())
])

pipeline_standard_alone = Pipeline([
    ('standard', StandardScaler()),
    ('model', LinearRegression())
])

# Perform cross-validation
cv_scores_minmax_then_standard = cross_val_score(
    pipeline_minmax_then_standard, X, y, cv=5, scoring='r2'
)

cv_scores_standard_alone = cross_val_score(
    pipeline_standard_alone, X, y, cv=5, scoring='r2'
)

# Results
print("MinMaxScaler -> StandardScaler:")
print(f"R2 scores: {cv_scores_minmax_then_standard}")
print(f"Mean R2: {np.mean(cv_scores_minmax_then_standard)}")

print("\nStandardScaler alone:")
print(f"R2 scores: {cv_scores_standard_alone}")
print(f"Mean R2: {np.mean(cv_scores_standard_alone)}")

Run in Colab

MinMaxScaler -> StandardScaler:
R2 scores: [0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
Mean R2: 0.5530311140279561

StandardScaler alone:
R2 scores: [0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
Mean R2: 0.5530311140279563

Pipelines in R

In R, pipelines are primarily implemented using the %>% operator from the magrittr package, which is also part of the dplyr package.

Example: Using dplyr and magrittr Pipelines

The %>% operator allows you to pass the output of one function directly as the input to the next function.

library(dplyr)

# Sample data frame
df <- data.frame(
  A = c(1, 2, 3, 4),
  B = c(10, 20, 30, 40)
)

# Using pipeline to process data
result <- df %>%
  mutate(C = A + B) %>%
  filter(C > 20) %>%
  arrange(desc(C))

print(result)

Advantages:

  • Simplicity: The code reads like a series of instructions.
  • Efficiency: Intermediate steps are not stored, reducing memory usage.
  • Composability: Functions can be combined easily, enhancing code reuse.

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!