Skip to content

Common distance measures in machine learning and their properties

Common distance measures in machine learning, their formulas, use cases, and detailed properties:

1. Euclidean Distance

  • Formula:
    d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
  • Use Cases:
  • Widely used in clustering (e.g., k-means) and nearest neighbor algorithms.
  • Suitable for continuous numerical data.
  • Properties:
  • Measures straight-line distance in n-dimensional space.
  • Requires data normalization when features have different scales.
  • Sensitive to outliers due to squaring differences.
  • Satisfies metric space properties:
    • Non-negativity: d(p, q) \geq 0
    • Symmetry: d(p, q) = d(q, p)
    • Triangle Inequality: d(p, q) + d(q, r) \geq d(p, r)

2. Manhattan Distance (L1 Norm)

  • Formula:
    d(p, q) = \sum_{i=1}^{n} |p_i - q_i|
  • Use Cases:
  • Common in high-dimensional spaces where differences are sparse.
  • Often used in grid-like pathfinding algorithms (e.g., robotics, chessboard movement).
  • Properties:
  • Measures the sum of absolute differences along each dimension.
  • Less sensitive to outliers compared to Euclidean distance.
  • Suitable for data with features on different scales (if normalized).

3. Minkowski Distance

  • Formula:
    d(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{1/p}
  • Use Cases:
  • A generalization of Euclidean and Manhattan distances.
  • Offers flexibility through parameter (p) to adjust sensitivity to differences.
  • Properties:
  • p=1: Reduces to Manhattan distance.
  • p=2: Reduces to Euclidean distance.
  • p=\infty: Reduces to Chebyshev distance.
  • Satisfies metric properties for (p \geq 1).
  • Higher p values give more weight to larger differences.

4. Cosine Similarity

  • Formula:
    \text{similarity}(p, q) = \frac{p \cdot q}{|p| |q|}
  • Use Cases:
  • Text analysis (e.g., document similarity in NLP).
  • High-dimensional vector spaces like TF-IDF or word embeddings.
  • Properties:
  • Measures the cosine of the angle between two vectors.
  • Ranges from ([-1, 1]):
    • (1): Identical orientation.
    • (0): Orthogonal (completely different).
    • (-1): Opposite orientation.
  • Unaffected by vector magnitude; requires normalized input.

5. Hamming Distance

  • Formula:
    d(p, q) = \sum_{i=1}^{n} \mathbf{1}(p_i \neq q_i)
  • Use Cases:
  • Error detection and correction (e.g., checksum, coding theory).
  • Comparison of binary or categorical data (e.g., DNA sequences, binary images).
  • Properties:
  • Counts positions with mismatched values.
  • Only applicable to categorical, string, or binary data.
  • Insensitive to the magnitude of differences.
See also  Machine Learning and Deep Learning Free Online Courses

6. Jaccard Distance

  • Formula:
    d(p, q) = 1 - \frac{|p \cap q|}{|p \cup q|}
  • Use Cases:
  • Set-based similarity and clustering tasks.
  • Binary features or set-like data (e.g., recommendation systems).
  • Properties:
  • Measures dissimilarity as the complement of Jaccard similarity.
  • Values range from 0 (identical sets) to 1 (completely disjoint sets).
  • Does not capture magnitude differences in non-binary data.

7. Mahalanobis Distance

  • Formula:
    d(p, q) = \sqrt{(p - q)^T S^{-1} (p - q)}
    where (S) is the covariance matrix.
  • Use Cases:
  • Multivariate outlier detection.
  • Feature selection and reduction.
  • Properties:
  • Accounts for variable correlations.
  • Distance is scale-invariant.
  • Sensitive to multicollinearity (requires invertible covariance matrix).
  • Reduces to Euclidean distance when (S) is the identity matrix.

8. Chebyshev Distance

  • Formula:
    d(p, q) = \max_{i} |p_i - q_i|
  • Use Cases:
  • Grid-based applications (e.g., chessboard distance).
  • Situations where strict thresholds are required.
  • Properties:
  • Measures the maximum coordinate difference.
  • Suitable for evenly weighted features.
  • Equivalent to (L_\infty) norm.

9. Bray-Curtis Distance

  • Formula:
    d(p, q) = \frac{\sum_{i=1}^{n} |p_i - q_i|}{\sum_{i=1}^{n} (p_i + q_i)}
  • Use Cases:
  • Used in ecological studies and compositional data analysis.
  • Properties:
  • Normalized between 0 and 1.
  • Sensitive to relative abundance rather than absolute differences.
  • Undefined when (p + q = 0).

10. Wasserstein Distance (Earth Mover’s Distance)

  • Formula:
  • Measures the minimum cost of transforming one probability distribution into another.
  • Use Cases:
  • Comparing probability distributions.
  • Generative models (e.g., Wasserstein GANs).
  • Properties:
  • Captures differences in both location and spread.
  • Suitable for 1D and higher-dimensional distributions.

11. Canberra Distance

  • Formula:
    d(p, q) = \sum_{i=1}^{n} \frac{|p_i - q_i|}{|p_i| + |q_i|}
  • Use Cases:
  • High-dimensional data with wide variability.
  • Data with significant feature scaling differences.
  • Properties:
  • Sensitive to small values near zero.
  • Emphasizes relative differences over absolute values.

12. Dynamic Time Warping (DTW)

  • Formula:
  • Computes the optimal alignment between two temporal sequences.
  • Use Cases:
  • Speech recognition, time series analysis.
  • Comparing sequences of different lengths.
  • Properties:
  • Aligns sequences non-linearly in time.
  • Allows for local stretching and compression of sequences.
See also  K-Means Clustering Method & Python Codes
Distance MeasureUse CasesKey Properties
EuclideanClustering, regressionSensitive to scale, outliers
ManhattanSparse dataRobust to outliers, measures along axes
MinkowskiGeneralizationFlexible sensitivity
Cosine SimilarityText, vectorsAngle-based, scale-invariant
HammingCategorical, binary dataMismatch counting
JaccardSets, binary dataFocuses on overlap
MahalanobisMultivariate analysisAccounts for correlations
ChebyshevGrid-based distancesMaximum coordinate difference
Bray-CurtisEcological dataSensitive to proportional differences
WassersteinProbability distributionsCaptures both location and spread
CanberraHigh variability datasetsEmphasizes relative differences
DTWTime series, speechNon-linear sequence alignment

Leave a Reply

error: Content is protected !!