CodexBloom - Programming Q&A Platform

Difficulty Implementing K-Means Clustering with Custom Distance Metric in Python - Getting Inconsistent Results

๐Ÿ‘€ Views: 285 ๐Ÿ’ฌ Answers: 1 ๐Ÿ“… Created: 2025-06-12
machine-learning clustering k-means Python

I'm not sure how to approach I'm currently implementing the K-means clustering algorithm in Python, specifically using a custom distance metric for clustering. I initially implemented the standard Euclidean distance, which worked perfectly. However, when I switched to a Manhattan distance for my custom scenario, the clustering results became inconsistent and sometimes led to empty clusters, particularly when the data is sparse. Hereโ€™s the relevant code snippet for my K-means implementation: ```python import numpy as np class KMeans: def __init__(self, n_clusters=3, max_iters=100): self.n_clusters = n_clusters self.max_iters = max_iters self.centroids = None def fit(self, X): n_samples, n_features = X.shape random_indices = np.random.choice(n_samples, self.n_clusters, replace=False) self.centroids = X[random_indices] for i in range(self.max_iters): labels = self._assign_labels(X) new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(self.n_clusters)]) if np.all(new_centroids == self.centroids): break self.centroids = new_centroids def _assign_labels(self, X): distances = np.array([self._manhattan_distance(X, centroid) for centroid in self.centroids]) return np.argmin(distances, axis=0) def _manhattan_distance(self, X, centroid): return np.sum(np.abs(X - centroid), axis=1) ``` I suspect that the scenario arises when the distance between data points and centroids is computed using Manhattan distance, especially for points that are far apart. Iโ€™ve tried initializing the centroids with random points and using a specific seed for reproducibility. However, I still face problems where some clusters end up empty, and the final centroids sometimes donโ€™t represent any actual data points. When debugging, I printed the distances calculated for each point, and I noticed that some points seemed to be assigned to the wrong clusters, particularly when they were equidistant to two centroids. Is there a best practice for handling this situation? Should I adjust how I handle empty clusters or reinitialize centroids when this occurs? Any insights on improving the robustness of my clustering would be greatly appreciated! I'm on macOS using the latest version of Python. I'm open to any suggestions.