A comic guide to Train – test split + Python & R codes

Once the dataset is collected and preprocessed, one can divide it into two separate sets: the training set and the testing set. The training set is used to train the model, while the testing set is used to evaluate its performance. This division allows us to assess how well the model generalizes to new, unseen data.

Python Code (scroll down for R code):

To perform a train-test split using sklearn, you can follow these steps using simulated data. I’ll show you how to generate synthetic data using numpy and then split it into training and testing sets using train_test_split from sklearn.model_selection.

import numpy as np
from sklearn.model_selection import train_test_split

# Generate synthetic data
# Features: 100 samples, each with 5 features
X = np.random.rand(100, 5)

# Labels: 100 binary labels (0 or 1)
y = np.random.randint(0, 2, 100)

# Split the data into training and testing sets
# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the shapes of the split data
print("Training feature set shape:", X_train.shape)
print("Testing feature set shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)

R Code:

Set a seed to ensure reproducibility, meaning that the same sequence of random numbers will be generated each time the code is run:

set.seed(42) # For reproducibility

Generate a synthetic dataset with 100 samples and 5 features. The features are created using a uniform random distribution:

# Generate synthetic data
# 100 samples, each with 5 features
X <- matrix(runif(100 * 5), nrow = 100, ncol = 5)

Generate 100 binary labels (0 or 1) to be used as target variables. These labels are generated randomly with replacement:

# Generate labels: 100 binary labels (0 or 1)
y <- sample(0:1, 100, replace = TRUE)

Combine the generated features and labels into a single data frame. The labels are converted to a factor type, which is appropriate for classification tasks:

# Combine the features and labels into a data frame
data <- data.frame(X, y = as.factor(y))

Split the data into training and testing sets. 80% of the data is used for training, and 20% is used for testing:

# Split the data into training and testing sets
# 80% training, 20% testing
train_index <- sample(seq_len(nrow(data)), size = 0.8 * nrow(data))

# Training data
train_data <- data[train_index, ]

# Testing data
test_data <- data[-train_index, ]

Output the dimensions of the training and testing sets to verify the split:

# Output the dimensions of the split data
cat("Training set dimensions:", dim(train_data), "\n")
cat("Testing set dimensions:", dim(test_data), "\n")

Combined codes:

# Load necessary library
set.seed(42) # For reproducibility

# Generate synthetic data
# 100 samples, each with 5 features
X <- matrix(runif(100 * 5), nrow = 100, ncol = 5)

# Generate labels: 100 binary labels (0 or 1)
y <- sample(0:1, 100, replace = TRUE)

# Combine the features and labels into a data frame
data <- data.frame(X, y = as.factor(y))

# Split the data into training and testing sets
# 80% training, 20% testing
train_index <- sample(seq_len(nrow(data)), size = 0.8 * nrow(data))

# Training data
train_data <- data[train_index, ]

# Testing data
test_data <- data[-train_index, ]

# Output the dimensions of the split data
cat("Training set dimensions:", dim(train_data), "\n")
cat("Testing set dimensions:", dim(test_data), "\n")



Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!