Inconsistent Results from K-Means Clustering Implementation in Python - Centroid Initialization Issues
I need help solving I'm stuck on something that should probably be simple. Hey everyone, I'm running into an issue that's driving me crazy. I'm working on a personal project and I'm facing an issue with my K-Means clustering implementation using Scikit-learn in Python. Despite using the same dataset and parameters, I get inconsistent clustering results each time I run the algorithm. I suspect it might be related to the initialization of centroids, especially since I'm using the default 'k-means++' initialization method. My dataset is relatively small, consisting of 500 samples with 5 features, and I'm trying to cluster it into 3 distinct groups. Here's a snippet of my code: ```python import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import make_blobs # Generate synthetic data X, y = make_blobs(n_samples=500, centers=3, cluster_std=0.60, random_state=0) # K-Means clustering kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, random_state=42) # Fit the model kmeans.fit(X) # Get the cluster labels labels = kmeans.labels_ print(labels) ``` When I run this code, I notice that the clusters don't always represent the same data points when I execute it multiple times. Sometimes, the points that should belong to the same cluster are assigned to different clusters, even though I'm using `random_state=42` to ensure reproducibility. I’ve tried adjusting the `n_init` parameter, increasing it to 20 in hopes of getting more stable results, but it hasn’t resolved the issue. Additionally, I’ve also tried using the `random` initialization method by setting `init='random'`, which further led to wildly different results. I want to understand the underlying reason for this inconsistency and how it can be mitigated. Are there specific best practices for centroid initialization in K-Means, or should I look into alternative methods or libraries for better stability? Any help would be greatly appreciated! Has anyone else encountered this? My development environment is Windows. Any ideas what could be causing this? I'm working on a web app that needs to handle this. I'm working with Python in a Docker container on Windows 11. Thanks in advance! This is part of a larger REST API I'm building. Am I missing something obvious?