Skip to content

downsampling for hyperparameter tuning

Downsampling for hyperparameter tuning reduces the dataset size to speed up model training and experimentation while preserving key data characteristics. Here’s a concise overview: Why Downsample for Hyperparameter Tuning? Key Considerations Practical Steps Pitfalls to… 

What’s missing completely at random data

Here are some more examples of MCAR (recall that Missing completely at random (MCAR) data occurs when the probability of missing data on a variable is independent of any other measured variables and the underlying… 

Expectation Maximization (EM) & implementation

Expectation Maximization (EM) is an iterative algorithm used for finding maximum likelihood estimates of parameters in statistical models, particularly when the model involves latent variables (variables that are not directly observed). The algorithm is commonly… 

A comic guide to denoising noisy data

Handling noisy data is a crucial step in data preprocessing and analysis. In general, here are some common approaches to manage noisy data: 1. Data Cleaning 2. Data Transformation 3. Statistical Techniques 4. Machine Learning… 

A comical guide to Missing Not At Random (MNAR)

Recall that Missing Not At Random (MNAR) is a type of missing data mechanism where the probability of missingness is related to the unobserved data itself. Here are some more examples of MNAR: In each… 

What’s Missing at Random (MAR)?

Missing at Random (MAR) is a statistical term indicating that the likelihood of data being missing is related to some of the observed data but not to the missing data itself. This means that the… 

denoising via dimension reduction in python

Dimension reduction methods like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) can be used for denoising data because they work by retaining the most important features (or dimensions) that capture the majority of… 

Missing data analysis: where’s your missing piece?

Why missing data occurs can be attributed to various reasons, including human error, malfunctioning equipment, or even intentional omission. It is important to handle missing data because it can significantly impact the reliability and accuracy… 

Imputation using SoftImpute in python

SoftImpute is a matrix completion algorithm in Python that allows you to fill in missing data in your dataset. This method is based on Singular Value Decomposition (SVD) and Iterative Soft Thresholding. Here’s a basic… 

K-Nearest Neighbors (KNN) imputation in sklearn

K-Nearest Neighbors (KNN) imputation is another method to handle missing data. It uses the ‘k’ closest instances (rows) to each instance that contains any missing values to fill in those values. In sklearn, you can… 

A comic guide to mean/median/mode imputation & Python codes

Handling missing data is a common preprocessing task in machine learning. In scikit-learn, you can handle missing data by using imputation techniques provided by the SimpleImputer class or by employing other strategies like dropping rows/columns with missing… 

SVD for dimension reduction

Singular Value Decomposition (SVD) is a powerful matrix decomposition technique that generalizes the concept of eigenvalue decomposition to non-square matrices. Eigenvalue decomposition specifically decomposes a square matrix into its constituent eigenvalues and eigenvectors. This decomposition… 

test for outliers in multivariate data in Python

To test for outliers in multivariate data in Python, you can use several libraries like numpy, scipy, pandas, sklearn, etc. Here’s how you can do it: Mahalanobis distance using Scipy library The Mahalanobis distance is a statistical measure used… 

error: Content is protected !!