Skip to content

Common distance measures in machine learning and their properties

Common distance measures in machine learning, their formulas, use cases, and detailed properties:

1. Euclidean Distance

  • Formula:
    d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
  • Use Cases:
  • Widely used in clustering (e.g., k-means) and nearest neighbor algorithms.
  • Suitable for continuous numerical data.
  • Properties:
  • Measures straight-line distance in n-dimensional space.
  • Requires data normalization when features have different scales.
  • Sensitive to outliers due to squaring differences.
  • Satisfies metric space properties:
    • Non-negativity: d(p, q) \geq 0
    • Symmetry: d(p, q) = d(q, p)
    • Triangle Inequality: d(p, q) + d(q, r) \geq d(p, r)

2. Manhattan Distance (L1 Norm)

  • Formula:
    d(p, q) = \sum_{i=1}^{n} |p_i - q_i|
  • Use Cases:
  • Common in high-dimensional spaces where differences are sparse.
  • Often used in grid-like pathfinding algorithms (e.g., robotics, chessboard movement).
  • Properties:
  • Measures the sum of absolute differences along each dimension.
  • Less sensitive to outliers compared to Euclidean distance.
  • Suitable for data with features on different scales (if normalized).

3. Minkowski Distance

  • Formula:
    d(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{1/p}
  • Use Cases:
  • A generalization of Euclidean and Manhattan distances.
  • Offers flexibility through parameter (p) to adjust sensitivity to differences.
  • Properties:
  • p=1: Reduces to Manhattan distance.
  • p=2: Reduces to Euclidean distance.
  • p=\infty: Reduces to Chebyshev distance.
  • Satisfies metric properties for (p \geq 1).
  • Higher p values give more weight to larger differences.

4. Cosine Similarity

  • Formula:
    \text{similarity}(p, q) = \frac{p \cdot q}{|p| |q|}
  • Use Cases:
  • Text analysis (e.g., document similarity in NLP).
  • High-dimensional vector spaces like TF-IDF or word embeddings.
  • Properties:
  • Measures the cosine of the angle between two vectors.
  • Ranges from ([-1, 1]):
    • (1): Identical orientation.
    • (0): Orthogonal (completely different).
    • (-1): Opposite orientation.
  • Unaffected by vector magnitude; requires normalized input.

5. Hamming Distance

  • Formula:
    d(p, q) = \sum_{i=1}^{n} \mathbf{1}(p_i \neq q_i)
  • Use Cases:
  • Error detection and correction (e.g., checksum, coding theory).
  • Comparison of binary or categorical data (e.g., DNA sequences, binary images).
  • Properties:
  • Counts positions with mismatched values.
  • Only applicable to categorical, string, or binary data.
  • Insensitive to the magnitude of differences.
See also  Feature selection & Model Selection

6. Jaccard Distance

  • Formula:
    d(p, q) = 1 - \frac{|p \cap q|}{|p \cup q|}
  • Use Cases:
  • Set-based similarity and clustering tasks.
  • Binary features or set-like data (e.g., recommendation systems).
  • Properties:
  • Measures dissimilarity as the complement of Jaccard similarity.
  • Values range from 0 (identical sets) to 1 (completely disjoint sets).
  • Does not capture magnitude differences in non-binary data.

7. Mahalanobis Distance

  • Formula:
    d(p, q) = \sqrt{(p - q)^T S^{-1} (p - q)}
    where (S) is the covariance matrix.
  • Use Cases:
  • Multivariate outlier detection.
  • Feature selection and reduction.
  • Properties:
  • Accounts for variable correlations.
  • Distance is scale-invariant.
  • Sensitive to multicollinearity (requires invertible covariance matrix).
  • Reduces to Euclidean distance when (S) is the identity matrix.

8. Chebyshev Distance

  • Formula:
    d(p, q) = \max_{i} |p_i - q_i|
  • Use Cases:
  • Grid-based applications (e.g., chessboard distance).
  • Situations where strict thresholds are required.
  • Properties:
  • Measures the maximum coordinate difference.
  • Suitable for evenly weighted features.
  • Equivalent to (L_\infty) norm.

9. Bray-Curtis Distance

  • Formula:
    d(p, q) = \frac{\sum_{i=1}^{n} |p_i - q_i|}{\sum_{i=1}^{n} (p_i + q_i)}
  • Use Cases:
  • Used in ecological studies and compositional data analysis.
  • Properties:
  • Normalized between 0 and 1.
  • Sensitive to relative abundance rather than absolute differences.
  • Undefined when (p + q = 0).

10. Wasserstein Distance (Earth Mover’s Distance)

  • Formula:
  • Measures the minimum cost of transforming one probability distribution into another.
  • Use Cases:
  • Comparing probability distributions.
  • Generative models (e.g., Wasserstein GANs).
  • Properties:
  • Captures differences in both location and spread.
  • Suitable for 1D and higher-dimensional distributions.

11. Canberra Distance

  • Formula:
    d(p, q) = \sum_{i=1}^{n} \frac{|p_i - q_i|}{|p_i| + |q_i|}
  • Use Cases:
  • High-dimensional data with wide variability.
  • Data with significant feature scaling differences.
  • Properties:
  • Sensitive to small values near zero.
  • Emphasizes relative differences over absolute values.

12. Dynamic Time Warping (DTW)

  • Formula:
  • Computes the optimal alignment between two temporal sequences.
  • Use Cases:
  • Speech recognition, time series analysis.
  • Comparing sequences of different lengths.
  • Properties:
  • Aligns sequences non-linearly in time.
  • Allows for local stretching and compression of sequences.
See also  Parameters and Loss function
Distance MeasureUse CasesKey Properties
EuclideanClustering, regressionSensitive to scale, outliers
ManhattanSparse dataRobust to outliers, measures along axes
MinkowskiGeneralizationFlexible sensitivity
Cosine SimilarityText, vectorsAngle-based, scale-invariant
HammingCategorical, binary dataMismatch counting
JaccardSets, binary dataFocuses on overlap
MahalanobisMultivariate analysisAccounts for correlations
ChebyshevGrid-based distancesMaximum coordinate difference
Bray-CurtisEcological dataSensitive to proportional differences
WassersteinProbability distributionsCaptures both location and spread
CanberraHigh variability datasetsEmphasizes relative differences
DTWTime series, speechNon-linear sequence alignment

Leave a Reply

error: Content is protected !!