Performance Bottleneck in K-Means Clustering Implementation in Python - Runtime Exceeds Expected Limits
I'm working on a personal project and I tried several approaches but none seem to work... I'm currently working on a K-Means clustering algorithm using Python and the scikit-learn library, and I'm facing a significant performance bottleneck when handling large datasets. The dataset contains about 1 million samples with 50 features, and the algorithm takes an unexpectedly long time to converge, often exceeding 15 minutes. I've tried optimizing the number of clusters and using the `n_init` parameter to reduce the number of initializations. However, I still observe long runtimes. Here's a simplified version of my implementation: ```python import numpy as np from sklearn.cluster import KMeans # Generate synthetic data X = np.random.rand(1000000, 50) # K-Means configuration kmeans = KMeans(n_clusters=10, n_init=10, max_iter=300, random_state=42) # Fit model kmeans.fit(X) ``` I also attempted to use the `init='k-means++'` option to improve initialization, but it didn’t help much. Additionally, I noticed the memory usage spikes when the algorithm is running, which makes me suspect that it might not be scaling well with the data size. Could there be any other configuration parameters or techniques I could use to speed up the convergence? Also, are there any best practices for managing memory usage in such large datasets?