How to Implement a K-Means Clustering Algorithm in R with Custom Distance Metric?

👀 Views: 1751 💬 Answers: 1 📅 Created: 2025-06-03

I need help solving I'm reviewing some code and I'm trying to implement the K-Means clustering algorithm in R for a dataset containing geographical coordinates (latitude and longitude) of various locations... However, I need to adapt the algorithm to use a custom distance metric instead of the standard Euclidean distance, specifically the great-circle distance. I've been using the `stats` package for the initial implementation, but I'm running into issues with how to efficiently replace the distance calculation. Here’s a simplified version of my code: ```r set.seed(123) # Sample data: Latitude and Longitude locations <- data.frame( lat = c(34.0522, 36.7783, 40.7128, 37.7749), lon = c(-118.2437, -119.4179, -74.0060, -122.4194) ) # Initializing centroids randomly k <- 2 # Number of clusters centroids <- locations[sample(nrow(locations), k), ] # Custom distance function using haversine formula haversine <- function(lat1, lon1, lat2, lon2) { R <- 6371 # Radius of the Earth in km delta_lat <- deg2rad(lat2 - lat1) delta_lon <- deg2rad(lon2 - lon1) a <- sin(delta_lat / 2) ^ 2 + cos(deg2rad(lat1)) * cos(deg2rad(lat2)) * sin(delta_lon / 2) ^ 2 c <- 2 * atan2(sqrt(a), sqrt(1 - a)) R * c # Distance in km } # Clustering process for (i in 1:10) { # Iterate 10 times clusters <- apply(locations, 1, function(row) { dists <- apply(centroids, 1, function(centroid) { haversine(row[1], row[2], centroid[1], centroid[2]) }) which.min(dists) # Assign to nearest centroid }) # Recalculate centroids centroids <- aggregate(locations, by = list(cluster = clusters), FUN = mean)[, -1] } ``` Despite the implementation, I find that the clusters do not change significantly after a few iterations, and I suspect that the custom distance metric is not being applied correctly. I also receive warnings about the mean function when calculating centroids, specifically `In mean.default(...) : argument is not numeric or logical: returning NA`. Has anyone successfully implemented K-Means with a custom distance metric in R? What could be the reason for the clusters not converging, and how can I correctly compute the new centroids using the great-circle distance? Any help would be greatly appreciated! I'm using R version 4.1.0 and have installed the latest version of the `stats` package. I'm working with R in a Docker container on Ubuntu 22.04. I'm coming from a different tech stack and learning R. Any pointers in the right direction?