Inconsistent Results from K-Means Clustering Implementation in Python - Centroid Initialization Issues

👀 Views: 0 💬 Answers: 1 📅 Created: 2025-06-14

k-means scikit-learn clustering

I'm facing an issue with my K-Means clustering implementation using Scikit-learn in Python. Despite using the same dataset and parameters, I get inconsistent clustering results each time I run the algorithm. I suspect it might be related to the initialization of centroids, especially since I'm using the default 'k-means++' initialization method. My dataset is relatively small, consisting of 500 samples with 5 features, and I'm trying to cluster it into 3 distinct groups. Here's a snippet of my code: ```python import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import make_blobs # Generate synthetic data X, y = make_blobs(n_samples=500, centers=3, cluster_std=0.60, random_state=0) # K-Means clustering kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, random_state=42) # Fit the model kmeans.fit(X) # Get the cluster labels labels = kmeans.labels_ print(labels) ``` When I run this code, I notice that the clusters don't always represent the same data points when I execute it multiple times. Sometimes, the points that should belong to the same cluster are assigned to different clusters, even though I'm using `random_state=42` to ensure reproducibility. I’ve tried adjusting the `n_init` parameter, increasing it to 20 in hopes of getting more stable results, but it hasn’t resolved the issue. Additionally, I’ve also tried using the `random` initialization method by setting `init='random'`, which further led to wildly different results. I want to understand the underlying reason for this inconsistency and how it can be mitigated. Are there specific best practices for centroid initialization in K-Means, or should I look into alternative methods or libraries for better stability? Any help would be greatly appreciated!