data preprocessing – Knowledge sparks

downsampling for hyperparameter tuning

by Kurious Fox
February 11, 2026February 11, 2026

Downsampling for hyperparameter tuning reduces the dataset size to speed up model training and experimentation while preserving key data characteristics. Here’s a concise overview: Why Downsample for Hyperparameter Tuning? Key Considerations Practical Steps Pitfalls to…

What’s missing completely at random data

by Kurious Fox
October 12, 2025October 12, 2025

Here are some more examples of MCAR (recall that Missing completely at random (MCAR) data occurs when the probability of missing data on a variable is independent of any other measured variables and the underlying…

Expectation Maximization (EM) & implementation

by Kurious Fox
June 14, 2024October 12, 2024

Expectation Maximization (EM) is an iterative algorithm used for finding maximum likelihood estimates of parameters in statistical models, particularly when the model involves latent variables (variables that are not directly observed). The algorithm is commonly…

A comic guide to denoising noisy data

by Kurious Fox
June 13, 2024October 12, 2024

Handling noisy data is a crucial step in data preprocessing and analysis. In general, here are some common approaches to manage noisy data: 1. Data Cleaning 2. Data Transformation 3. Statistical Techniques 4. Machine Learning…

A comical guide to Missing Not At Random (MNAR)

by Kurious Fox
June 13, 2024October 12, 2024

Recall that Missing Not At Random (MNAR) is a type of missing data mechanism where the probability of missingness is related to the unobserved data itself. Here are some more examples of MNAR: In each…

What’s Missing at Random (MAR)?

by Kurious Fox
June 13, 2024October 12, 2025

Missing at Random (MAR) is a statistical term indicating that the likelihood of data being missing is related to some of the observed data but not to the missing data itself. This means that the…

Generating missing data and evaluating missing data analysis in Python

by Kurious Fox
May 28, 2024October 12, 2024

Generating missing values Generating missing values with a given percentage of missingness for a dataframe or numpy array: Generating missing values with a given missing rate for a time series list: Calculating MSE ignoring missing…

Introduction to Principal Component Analysis (PCA) and implementation in R and Python

by Kurious Fox
May 26, 2024February 15, 2026

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction, which simplifies the complexity in high-dimensional data while retaining important infomation. The basic idea of this method is to transform a large set…

denoising via dimension reduction in python

by Kurious Fox
May 19, 2024October 12, 2024

Dimension reduction methods like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) can be used for denoising data because they work by retaining the most important features (or dimensions) that capture the majority of…

why we can & probably should use missing at random imputation methods for data that’s not missing at random?

by Kurious Fox
May 17, 2024October 12, 2024

Missing At Random (MAR) imputation methods are based on the assumption that the chance of missing data is not related to the missing data itself, but might be related to some of the observed data.…

Missing data analysis: where’s your missing piece?

by Kurious Fox
May 17, 2024September 6, 2024

Why missing data occurs can be attributed to various reasons, including human error, malfunctioning equipment, or even intentional omission. It is important to handle missing data because it can significantly impact the reliability and accuracy…

Imputation using SoftImpute in python

by Kurious Fox
May 17, 2024October 12, 2024

SoftImpute is a matrix completion algorithm in Python that allows you to fill in missing data in your dataset. This method is based on Singular Value Decomposition (SVD) and Iterative Soft Thresholding. Here’s a basic…

Multiple Imputation with Chained Equations method & Python codes

by Kurious Fox
May 17, 2024October 12, 2024

MICE (Multiple Imputation by Chained Equations) is a statistical method used for handling missing data by creating multiple imputations or “guesses” for the missing values. It works by using a set of regression models to…

K-Nearest Neighbors (KNN) imputation in sklearn

by Kurious Fox
May 17, 2024October 12, 2024

K-Nearest Neighbors (KNN) imputation is another method to handle missing data. It uses the ‘k’ closest instances (rows) to each instance that contains any missing values to fill in those values. In sklearn, you can…

A comic guide to mean/median/mode imputation & Python codes

by Kurious Fox
May 17, 2024October 12, 2024

Handling missing data is a common preprocessing task in machine learning. In scikit-learn, you can handle missing data by using imputation techniques provided by the SimpleImputer class or by employing other strategies like dropping rows/columns with missing…

SVD for dimension reduction

by Kurious Fox
May 16, 2024August 18, 2024

Singular Value Decomposition (SVD) is a powerful matrix decomposition technique that generalizes the concept of eigenvalue decomposition to non-square matrices. Eigenvalue decomposition specifically decomposes a square matrix into its constituent eigenvalues and eigenvectors. This decomposition…

test for outliers in multivariate data in Python

by Kurious Fox
May 16, 2024August 18, 2024

To test for outliers in multivariate data in Python, you can use several libraries like numpy, scipy, pandas, sklearn, etc. Here’s how you can do it: Mahalanobis distance using Scipy library The Mahalanobis distance is a statistical measure used…