test for outliers in multivariate data in Python

To test for outliers in multivariate data in Python, you can use several libraries like numpyscipypandassklearn, etc. Here’s how you can do it:

Mahalanobis distance using Scipy library

The Mahalanobis distance is a statistical measure used for multivariate outlier detection, measuring how many standard deviations a point is from the mean of the dataset. It considers the covariance between variables, which means it accurately captures the notion of “distance” even when variables are correlated. Points with a large Mahalanobis distance are considered to be outliers.

from scipy.spatial import distance
mean = np.mean(df, axis=0)
cov = np.cov(df.values.T)
inv_covmat = np.linalg.inv(cov)
d = [distance.mahalanobis(x, mean, inv_covmat) for x in df.values]
df['Mahalanobis'] = d
outliers = df[df['Mahalanobis'] > np.percentile(df['Mahalanobis'], 97.5)] # change the percentile based on your requirement

Using Sklearn’s DBSCAN for clustering

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a clustering algorithm used in machine learning that can also serve for outlier detection. Unlike many clustering algorithms, DBSCAN doesn’t require the user to specify the number of clusters, rather it identifies high-density regions and separates out areas of low density (noise), which can be considered as outliers. Its ability to find arbitrarily shaped clusters and its good performance with large datasets make it a popular choice for anomaly detection.

from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(df)
labels = clustering.labels_
outliers = df[labels == -1]

Using Sklearn’s Isolation Forest

Isolation Forest is a machine learning algorithm used primarily for anomaly detection. It identifies outliers or anomalies in a dataset by isolating points in random forests, hence the name “Isolation Forest”.

from sklearn.ensemble import IsolationForest
clf = IsolationForest()
preds = clf.fit_predict(df)
outliers = df[preds == -1]

More details on the parameters. The full set is:

clf = IsolationForest(n_estimators=100, 
                      max_samples='auto', 
                      contamination=float(.12), 
                      max_features=1.0, 
                      bootstrap=False, 
                      n_jobs=-1, 
                      random_state=42, 
                      verbose=0)

Let’s break down what these parameters mean:

  • n_estimators: The number of base estimators in the ensemble. In other words, it’s the number of trees in the forest.
  • max_samples: The number of samples to draw from X to train each base estimator. If ‘auto’, then max_samples=min(256, n_samples)
  • contamination: The proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
  • max_features: The number of features to draw from X to train each base estimator.
  • bootstrap: If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
  • n_jobs: The number of jobs to run in parallel. Fit, predict, decision_function and score_samples are all parallelized over the trees. -1 means using all processors.
  • random_state: Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible output across multiple function calls.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!