

Subscribe to get access
Read more of this content when you subscribe today.
K-Means Clustering is a popular unsupervised machine learning algorithm used for clustering data into groups. It is widely used in various fields such as image processing, market segmentation, and document clustering. The algorithm works by iteratively assigning data points to the nearest cluster centroid and then recalculating the centroid based on the mean of the points assigned to it.
One of the key advantages of K-Means Clustering is its computational efficiency, making it suitable for large datasets. However, it is important to note that the algorithm’s performance can be sensitive to the initial placement of the cluster centroids. This sensitivity to initial centroids can sometimes lead to suboptimal clustering results, especially when dealing with high-dimensional or non-linear data. To address this issue, techniques such as K-Means++ initialization or running the algorithm multiple times with different initializations can be employed to improve the quality of the clustering. Additionally, it’s crucial to carefully consider the choice of the number of clusters, as selecting an inappropriate number can impact the interpretability of the results. Despite these considerations, when applied thoughtfully and with attention to its limitations, K-Means Clustering remains a valuable tool for exploratory data analysis and pattern recognition in diverse fields such as image processing, market segmentation, and anomaly detection.
Implementation
1. Importing necessary libraries and creating the dataset
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=200, centers=4, random_state=0, cluster_std=0.6)
This chunk of code is doing two things: importing the necessary libraries and creating a sample dataset. The make_blobs
function generates isotropic Gaussian blobs which are ideal for clustering.
2. Creating the KMeans model and fitting the data
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
Here, we’re creating a KMeans instance with 4 clusters (as defined by n_clusters=4
). This matches the number of centers in our dataset. The fit(X)
function then applies the KMeans algorithm to our dataset.
3. Predicting the clusters
y_kmeans = kmeans.predict(X)
This line uses the predict(X)
function to assign each data point in X
to one of the clusters. The predictions are stored in y_kmeans
.
4. Plotting the data points and cluster centers
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.show()
This chunk of code is creating a scatter plot with the data points color-coded according to their cluster assignment. The cluster centers are represented by the black dots. The plt.show()
function displays the plot.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.