Skip to content

K-Means Clustering Method & Python Codes

K-Means Clustering is a popular unsupervised machine learning algorithm used for clustering data into groups. It is widely used in various fields such as image processing, market segmentation, and document clustering. The algorithm works by iteratively assigning data points to the nearest cluster centroid and then recalculating the centroid based on the mean of the points assigned to it.

One of the key advantages of K-Means Clustering is its computational efficiency, making it suitable for large datasets. However, it is important to note that the algorithm’s performance can be sensitive to the initial placement of the cluster centroids. This sensitivity to initial centroids can sometimes lead to suboptimal clustering results, especially when dealing with high-dimensional or non-linear data. To address this issue, techniques such as K-Means++ initialization or running the algorithm multiple times with different initializations can be employed to improve the quality of the clustering. Additionally, it’s crucial to carefully consider the choice of the number of clusters, as selecting an inappropriate number can impact the interpretability of the results. Despite these considerations, when applied thoughtfully and with attention to its limitations, K-Means Clustering remains a valuable tool for exploratory data analysis and pattern recognition in diverse fields such as image processing, market segmentation, and anomaly detection.

Toy example

First, let’s take a simple example of how K-means clustering works using a small, toy dataset, to understand the algorithm deeply before diving into the code. We’ll group a few 2D data points into clusters based on their proximity.

Let’s use the following 2D points:

  • (1, 2)
  • (1.5, 1.8)
  • (5, 8)
  • (8, 8)
  • (1, 0.6)
  • (9, 11)
See also  Key Roles of Transformation Matrices in Regression and PCA

We’ll cluster these into 2 clusters (k=2) and use a simple manual explanation of how the algorithm would proceed.

1. Initialize Centroids

The K-means algorithm randomly selects two initial centroids (cluster centers) from the data points. Let’s assume:

  • Centroid 1: (1, 2)
  • Centroid 2: (8, 8)
2. Assign Points to Clusters

The algorithm assigns each point to the nearest centroid. To calculate “nearest”, we use the Euclidean distance formula. Based on the distances to these centroids, we can assign each point to a cluster:

So after the first assignment, we have:

  • Cluster 1: (1, 2), (1.5, 1.8), (1, 0.6)
  • Cluster 2: (5, 8), (8, 8), (9, 11)
3. Recompute Centroids

Now, the centroids are updated by computing the mean of the points assigned to each cluster:

  • New Centroid 1 = Mean of points in Cluster 1: \left(\frac{1 + 1.5 + 1}{3}, \frac{2 + 1.8 + 0.6}{3}\right) = (1.17, 1.47)
  • New Centroid 2 = Mean of points in Cluster 2: \left(\frac{5 + 8 + 9}{3}, \frac{8 + 8 + 11}{3}\right) = (7.33, 9)
4. Reassign Points

The points are reassigned to the nearest centroid, and the process repeats. Over iterations, the clusters stabilize when the assignments no longer change.

Implementation of KMeans in Python

1. Importing necessary libraries and creating the dataset

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=200, centers=4, random_state=0, cluster_std=0.6)

This chunk of code is doing two things: importing the necessary libraries and creating a sample dataset. The make_blobs function generates isotropic Gaussian blobs which are ideal for clustering.

2. Creating the KMeans model and fitting the data

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

Here, we’re creating a KMeans instance with 4 clusters (as defined by n_clusters=4). This matches the number of centers in our dataset. The fit(X) function then applies the KMeans algorithm to our dataset.

3. Predicting the clusters

y_kmeans = kmeans.predict(X)

This line uses the predict(X) function to assign each data point in X to one of the clusters. The predictions are stored in y_kmeans.

4. Plotting the data points and cluster centers

import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

plt.show()

This chunk of code is creating a scatter plot with the data points color-coded according to their cluster assignment. The cluster centers are represented by the black dots. The plt.show() function displays the plot.

See also  A comic guide to Train - test split + Python & R codes

Leave a Reply

error: Content is protected !!