Grid search and train-validation-test split for hyperparameter tuning – intro

Subscribe to get access

??Subscribe to read the rest of the comics, the fun you can’t miss ??

In train-validation-test split: The training set is used to fit the model, the validation set is used to tune the model’s hyperparameters, and the test set is used to evaluate the model’s performance. This process helps ensure that the model generalizes well to new, unseen data.

Python codes

Here’s a step-by-step example of how to perform a train-validation-test split on a dataset. We’ll use Python’s scikit-learn library for splitting the data. This process helps ensure that the model is evaluated on unseen data to avoid overfitting and to tune hyperparameters.

Step 1: Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

Step 2: Create a sample dataset

For the sake of the example, let’s create a dummy dataset.

# Create a sample dataset
np.random.seed(42)
X = np.random.rand(1000, 10)  # 1000 samples, 10 features
y = np.random.randint(0, 2, size=(1000,))  # Binary target variable (0 or 1)

Here, X is the feature matrix (1000 rows and 10 columns), and y is the target variable with binary values (0 or 1).

Step 3: Split the dataset into Train + Temp (Validation + Test)

First, we split the dataset into two parts:

  1. Training set: 60% of the data, used to train the model.
  2. Temporary set (Temp): 40% of the data, which will later be split into validation and test sets.
# First split: Train and Temp (Validation + Test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)

Here, 60% of the data is assigned to X_train and y_train, and 40% to X_temp and y_temp.

Step 4: Split the Temp set into Validation and Test sets

Next, split the Temp set into:

  1. Validation set: 20% of the original dataset (half of the Temp set), used for hyperparameter tuning.
  2. Test set: 20% of the original dataset (the other half of the Temp set), used for the final evaluation.
# Second split: Temp -> Validation and Test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Now, you have three sets:

  • X_train, y_train: Training set (60% of the data), used for training the model.
  • X_val, y_val: Validation set (20% of the data), used for tuning hyperparameters and model selection.
  • X_test, y_test: Test set (20% of the data), used for final evaluation of the model after it has been trained and validated.

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!