K-Nearest Neighbors for Time Series Prediction in Python with sktime

K-nearest neighbours Time Series Regressor is a powerful machine learning algorithm that leverages the concept of proximity among data points to predict future values based on historical data. By examining the ‘k’ closest training examples in the feature space, it effectively identifies patterns and trends over time. This approach can be particularly beneficial in scenarios where data exhibits temporal correlations, allowing for accurate forecasting in various applications such as finance, weather prediction, and inventory management. As a non-parametric method, it does not assume any underlying distribution, thus providing flexibility in modeling complex datasets while still being simple to implement and understand. When tuning the ‘k’ parameter appropriately, practitioners can achieve a balance between sensitivity to noise and bias in predictions, enhancing the overall performance of the regressor.

The K-Nearest Neighbors (KNN) Regressor is particularly effective in scenarios where the relationship between the features and the target variable is non-linear and complex. It works well when the data is highly localized, meaning that similar data points tend to have similar target values. This makes KNN suitable for datasets where the target value can be approximated by averaging the values of its nearest neighbors.

Additionally, KNN is useful when you have a small to moderately sized dataset because it doesn’t require extensive training time like some other algorithms. It’s also beneficial when you need a simple and interpretable model, as it doesn’t involve complex transformations or assumptions about the data distribution.

However, KNN can struggle with high-dimensional data due to the “curse of dimensionality,” and it may not perform well if the data has many irrelevant features. It’s also computationally expensive for large datasets since it requires calculating distances between data points.


Python implementation

In this tutorial, we’ll use the sktime library with the ArrowHead dataset. The ArrowHead dataset is a time series classification dataset that consists of outlines of arrowhead images. The shapes of these projectile points are converted into time series data using an angle-based method.

Does this help clarify what the ArrowHead dataset is about?

1. Import Necessary Libraries

from sktime.datasets import load_arrow_head
from sktime.regression.distance_based import KNeighborsTimeSeriesRegressor
from sklearn.metrics import mean_squared_error
  • load_arrow_head: This function loads the “Arrow Head” dataset, a univariate time series dataset available in sktime.
  • KNeighborsTimeSeriesRegressor: A k-nearest neighbors (k-NN) algorithm specialized for time series regression tasks.
  • mean_squared_error: A metric from sklearn used to evaluate the performance of the regressor by calculating the mean squared error (MSE) between the actual and predicted values.

2. Load the Dataset

X_train, y_train = load_arrow_head(split="train", return_X_y=True)
X_test, y_test = load_arrow_head(split="test", return_X_y=True)
  • load_arrow_head(split="train", return_X_y=True):
    • Loads the training portion of the “Arrow Head” dataset.
    • The parameter return_X_y=True ensures the function returns the feature set (X_train) and target labels (y_train) separately, rather than in a composite data structure.
  • Similarly, the split="test" part loads the test dataset into X_test and y_test.

3. Convert Target Variables to Float

y_train = y_train.astype("float")
y_test = y_test.astype("float")
  • Ensures that the target values (y_train and y_test) are of type float, which is necessary for numerical calculations like mean squared error.
  • Depending on the dataset, this step may not always be necessary. It’s included here to ensure compatibility with mean_squared_error.

4. Initialize the Regressor

regressor = KNeighborsTimeSeriesRegressor()
  • Initializes the k-nearest neighbors (k-NN) regressor for time series data.
  • This regressor uses distance-based measures (like Dynamic Time Warping) to compare time series instead of simple Euclidean distances.

5. Train the Regressor

regressor.fit(X_train, y_train)
  • Fits the regressor to the training dataset (X_train, y_train).
  • The model learns how to predict the target variable (y_train) based on the patterns in the input data (X_train).

6. Make Predictions

y_pred = regressor.predict(X_test)
  • Uses the trained model to predict the target variable for the test dataset (X_test).
  • The result, y_pred, contains the predicted values.

7. Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
  • Compares the predicted values (y_pred) to the actual values (y_test) using the Mean Squared Error (MSE) metric. Smaller MSE values indicate better performance.
  • Prints the MSE to assess how well the model performed on the test data.

Key Notes

  • The KNeighborsTimeSeriesRegressor works well for time series data due to its ability to compare sequences based on their shape and temporal structure. The default implementation uses Dynamic Time Warping (DTW) distance for similarity calculations.
  • Dynamic Time Warping (DTW) is a powerful algorithm used for measuring similarity between two time series, allowing them to be aligned in a flexible manner even if they vary in speed or timing.

Combine code (click to download) :

from sktime.datasets import load_arrow_head
from sktime.regression.distance_based import KNeighborsTimeSeriesRegressor
from sklearn.metrics import mean_squared_error
# Load the arrow head dataset
X_train, y_train = load_arrow_head(split="train", return_X_y=True)
X_test, y_test = load_arrow_head(split="test", return_X_y=True)
# Ensure the target variables are floats (if needed)
y_train = y_train.astype("float")
y_test = y_test.astype("float")
# Initialize and train the KNeighborsTimeSeriesRegressor
regressor = KNeighborsTimeSeriesRegressor()
regressor.fit(X_train, y_train)
# Predict and evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!