Challenges with Implementing K-Means Clustering in Python - Unexpected Cluster Assignments

👀 Views: 6 💬 Answers: 1 📅 Created: 2025-06-04

python numpy k-means machine-learning Python

I'm writing unit tests and I've been working on this all day and I've been banging my head against this for hours. I'm following best practices but Can someone help me understand I'm currently implementing the K-Means clustering algorithm using Python 3.10 and the popular NumPy library, but I'm running into issues with unexpected cluster assignments that don't seem to make sense given my input data. After initializing my centroids randomly, I expected the algorithm to converge to logical clusters, but instead, I see some clusters containing points that are clearly far from their corresponding centroids. Here's a simplified version of my implementation: ```python import numpy as np # Sample data points X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) # Number of clusters k = 2 # Randomly initialize centroids np.random.seed(42) centroids = X[np.random.choice(X.shape[0], k, replace=False)] for _ in range(10): # Assign clusters based on closest centroid distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2) labels = np.argmin(distances, axis=1) # Update centroids new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)]) if np.all(centroids == new_centroids): break centroids = new_centroids print('Final centroids:', centroids) print('Cluster labels:', labels) ``` When I run this code, the final centroids printed are sometimes clustered in regions where there are no data points, leading to label assignments that place points from one dense group into a different cluster. I’ve tried adjusting the initialization method to use the K-Means++ algorithm for centroid selection, but that hasn't resolved the issue either. Here's an example of the output I get: ``` Final centroids: [[ 1. 2.] [10. 2.]] Cluster labels: [0 0 0 1 1 1] ``` In this case, it looks correct, but I noticed that when I change the data points slightly, the labels can swap, resulting in less meaningful clusters. Is there any advice on how to ensure that my clusters are more stable across different initializations? Could there be any issues with the way I'm calculating distances or updating centroids? Any insights would be greatly appreciated! Thanks for any help you can provide! I'd really appreciate any guidance on this. I'd be grateful for any help.