Subscribe to get access
??Subscribe to read the rest of the comics, the fun you can’t miss ??
In train-validation-test split: The training set is used to fit the model, the validation set is used to tune the model’s hyperparameters, and the test set is used to evaluate the model’s performance. This process helps ensure that the model generalizes well to new, unseen data.
Python codes
Here’s a step-by-step example of how to perform a train-validation-test split on a dataset. We’ll use Python’s scikit-learn
library for splitting the data. This process helps ensure that the model is evaluated on unseen data to avoid overfitting and to tune hyperparameters.
Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
Step 2: Create a sample dataset
For the sake of the example, let’s create a dummy dataset.
# Create a sample dataset
np.random.seed(42)
X = np.random.rand(1000, 10) # 1000 samples, 10 features
y = np.random.randint(0, 2, size=(1000,)) # Binary target variable (0 or 1)
Here, X
is the feature matrix (1000 rows and 10 columns), and y
is the target variable with binary values (0 or 1).
Step 3: Split the dataset into Train + Temp (Validation + Test)
First, we split the dataset into two parts:
- Training set: 60% of the data, used to train the model.
- Temporary set (Temp): 40% of the data, which will later be split into validation and test sets.
# First split: Train and Temp (Validation + Test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
Here, 60% of the data is assigned to X_train
and y_train
, and 40% to X_temp
and y_temp
.
Step 4: Split the Temp set into Validation and Test sets
Next, split the Temp set into:
- Validation set: 20% of the original dataset (half of the Temp set), used for hyperparameter tuning.
- Test set: 20% of the original dataset (the other half of the Temp set), used for the final evaluation.
# Second split: Temp -> Validation and Test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Now, you have three sets:
X_train
,y_train
: Training set (60% of the data), used for training the model.X_val
,y_val
: Validation set (20% of the data), used for tuning hyperparameters and model selection.X_test
,y_test
: Test set (20% of the data), used for final evaluation of the model after it has been trained and validated.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.