CodexBloom - Programming Q&A Platform

Unexpected Time Complexity Increase in K-Means Clustering Algorithm with Large Datasets

πŸ‘€ Views: 2 πŸ’¬ Answers: 1 πŸ“… Created: 2025-06-11
scikit-learn k-means machine-learning Python

I'm collaborating on a project where I've been struggling with this for a few days now and could really use some help. I'm implementing the K-Means clustering algorithm using Python and `scikit-learn` (version 0.24.2) for a project involving large datasets (millions of data points). However, I'm working with performance optimization that lead to unacceptably long execution times as the dataset grows. Initially, I used the following code to fit the model: ```python from sklearn.cluster import KMeans import numpy as np # Simulated large dataset data = np.random.rand(1000000, 10) # 1 million points in a 10-dimensional space kmeans = KMeans(n_clusters=5, random_state=42) # Fit the model groups = kmeans.fit_predict(data) ``` While this works fine for smaller datasets, once I scale up to about 10 million data points, the `fit_predict` method takes significantly longer than expected, and I receive a warning that states: `ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (5). Possibly due to duplicate points in X.` I’ve tried increasing the `max_iter` parameter, but it barely makes a difference, and I suspect the question lies in how I'm generating the data or handling the distances. I also checked if the data has many duplicates or is very sparse, but that doesn’t seem to be the case. Additionally, I thought about using the `n_init` parameter to run multiple initializations, but it also seems to increase the computation time further. What could I do to optimize my K-Means implementation for such large datasets without sacrificing cluster quality? Are there alternative algorithms or techniques I should consider for better scalability? Is there a better approach? I recently upgraded to Python LTS.