Common distance measures in machine learning, their formulas, use cases, and detailed properties:
1. Euclidean Distance
- Formula:
- Use Cases:
- Widely used in clustering (e.g., k-means) and nearest neighbor algorithms.
- Suitable for continuous numerical data.
- Properties:
- Measures straight-line distance in n-dimensional space.
- Requires data normalization when features have different scales.
- Sensitive to outliers due to squaring differences.
- Satisfies metric space properties:
- Non-negativity:
- Symmetry:
- Triangle Inequality:
- Non-negativity:
2. Manhattan Distance (L1 Norm)
- Formula:
- Use Cases:
- Common in high-dimensional spaces where differences are sparse.
- Often used in grid-like pathfinding algorithms (e.g., robotics, chessboard movement).
- Properties:
- Measures the sum of absolute differences along each dimension.
- Less sensitive to outliers compared to Euclidean distance.
- Suitable for data with features on different scales (if normalized).
3. Minkowski Distance
- Formula:
- Use Cases:
- A generalization of Euclidean and Manhattan distances.
- Offers flexibility through parameter (p) to adjust sensitivity to differences.
- Properties:
: Reduces to Manhattan distance.
: Reduces to Euclidean distance.
: Reduces to Chebyshev distance.
- Satisfies metric properties for (p \geq 1).
- Higher p values give more weight to larger differences.
4. Cosine Similarity
- Formula:
- Use Cases:
- Text analysis (e.g., document similarity in NLP).
- High-dimensional vector spaces like TF-IDF or word embeddings.
- Properties:
- Measures the cosine of the angle between two vectors.
- Ranges from ([-1, 1]):
- (1): Identical orientation.
- (0): Orthogonal (completely different).
- (-1): Opposite orientation.
- Unaffected by vector magnitude; requires normalized input.
5. Hamming Distance
- Formula:
- Use Cases:
- Error detection and correction (e.g., checksum, coding theory).
- Comparison of binary or categorical data (e.g., DNA sequences, binary images).
- Properties:
- Counts positions with mismatched values.
- Only applicable to categorical, string, or binary data.
- Insensitive to the magnitude of differences.
6. Jaccard Distance
- Formula:
- Use Cases:
- Set-based similarity and clustering tasks.
- Binary features or set-like data (e.g., recommendation systems).
- Properties:
- Measures dissimilarity as the complement of Jaccard similarity.
- Values range from 0 (identical sets) to 1 (completely disjoint sets).
- Does not capture magnitude differences in non-binary data.
7. Mahalanobis Distance
- Formula:
where (S) is the covariance matrix. - Use Cases:
- Multivariate outlier detection.
- Feature selection and reduction.
- Properties:
- Accounts for variable correlations.
- Distance is scale-invariant.
- Sensitive to multicollinearity (requires invertible covariance matrix).
- Reduces to Euclidean distance when (S) is the identity matrix.
8. Chebyshev Distance
- Formula:
- Use Cases:
- Grid-based applications (e.g., chessboard distance).
- Situations where strict thresholds are required.
- Properties:
- Measures the maximum coordinate difference.
- Suitable for evenly weighted features.
- Equivalent to (L_\infty) norm.
9. Bray-Curtis Distance
- Formula:
- Use Cases:
- Used in ecological studies and compositional data analysis.
- Properties:
- Normalized between 0 and 1.
- Sensitive to relative abundance rather than absolute differences.
- Undefined when (p + q = 0).
10. Wasserstein Distance (Earth Mover’s Distance)
- Formula:
- Measures the minimum cost of transforming one probability distribution into another.
- Use Cases:
- Comparing probability distributions.
- Generative models (e.g., Wasserstein GANs).
- Properties:
- Captures differences in both location and spread.
- Suitable for 1D and higher-dimensional distributions.
11. Canberra Distance
- Formula:
- Use Cases:
- High-dimensional data with wide variability.
- Data with significant feature scaling differences.
- Properties:
- Sensitive to small values near zero.
- Emphasizes relative differences over absolute values.
12. Dynamic Time Warping (DTW)
- Formula:
- Computes the optimal alignment between two temporal sequences.
- Use Cases:
- Speech recognition, time series analysis.
- Comparing sequences of different lengths.
- Properties:
- Aligns sequences non-linearly in time.
- Allows for local stretching and compression of sequences.
| Distance Measure | Use Cases | Key Properties |
|---|---|---|
| Euclidean | Clustering, regression | Sensitive to scale, outliers |
| Manhattan | Sparse data | Robust to outliers, measures along axes |
| Minkowski | Generalization | Flexible sensitivity |
| Cosine Similarity | Text, vectors | Angle-based, scale-invariant |
| Hamming | Categorical, binary data | Mismatch counting |
| Jaccard | Sets, binary data | Focuses on overlap |
| Mahalanobis | Multivariate analysis | Accounts for correlations |
| Chebyshev | Grid-based distances | Maximum coordinate difference |
| Bray-Curtis | Ecological data | Sensitive to proportional differences |
| Wasserstein | Probability distributions | Captures both location and spread |
| Canberra | High variability datasets | Emphasizes relative differences |
| DTW | Time series, speech | Non-linear sequence alignment |