To test for outliers in multivariate data in Python, you can use several libraries like numpy
, scipy
, pandas
, sklearn
, etc. Here’s how you can do it:
Mahalanobis distance using Scipy library
The Mahalanobis distance is a statistical measure used for multivariate outlier detection, measuring how many standard deviations a point is from the mean of the dataset. It considers the covariance between variables, which means it accurately captures the notion of “distance” even when variables are correlated. Points with a large Mahalanobis distance are considered to be outliers.
from scipy.spatial import distance mean = np.mean(df, axis=0) cov = np.cov(df.values.T) inv_covmat = np.linalg.inv(cov) d = [distance.mahalanobis(x, mean, inv_covmat) for x in df.values] df['Mahalanobis'] = d outliers = df[df['Mahalanobis'] > np.percentile(df['Mahalanobis'], 97.5)] # change the percentile based on your requirement
Using Sklearn’s DBSCAN for clustering
DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a clustering algorithm used in machine learning that can also serve for outlier detection. Unlike many clustering algorithms, DBSCAN doesn’t require the user to specify the number of clusters, rather it identifies high-density regions and separates out areas of low density (noise), which can be considered as outliers. Its ability to find arbitrarily shaped clusters and its good performance with large datasets make it a popular choice for anomaly detection.
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(df)
labels = clustering.labels_
outliers = df[labels == -1]
Using Sklearn’s Isolation Forest
Isolation Forest is a machine learning algorithm used primarily for anomaly detection. It identifies outliers or anomalies in a dataset by isolating points in random forests, hence the name “Isolation Forest”.
from sklearn.ensemble import IsolationForest
clf = IsolationForest()
preds = clf.fit_predict(df)
outliers = df[preds == -1]
More details on the parameters. The full set is:
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(.12), max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
Let’s break down what these parameters mean:
n_estimators
: The number of base estimators in the ensemble. In other words, it’s the number of trees in the forest.max_samples
: The number of samples to draw from X to train each base estimator. If ‘auto’, then max_samples=min(256, n_samples)contamination
: The proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.max_features
: The number of features to draw from X to train each base estimator.bootstrap
: If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.n_jobs
: The number of jobs to run in parallel. Fit, predict, decision_function and score_samples are all parallelized over the trees.-1
means using all processors.random_state
: Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible output across multiple function calls.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.