How to efficiently compute pairwise Euclidean distances with large datasets using NumPy?
I'm sure I'm missing something obvious here, but This might be a silly question, but I'm trying to compute the pairwise Euclidean distances between two large datasets using NumPy, but I'm running into performance issues. I've got two arrays, `A` and `B`, with shapes `(10000, 128)` and `(10000, 128)` respectively. My naive approach using a double for loop takes way too long, and I want to leverage NumPy for better performance. Here's a simple implementation I tried: ```python import numpy as np A = np.random.rand(10000, 128) B = np.random.rand(10000, 128) # Naive implementation - too slow pairwise_distances = np.zeros((A.shape[0], B.shape[0])) for i in range(A.shape[0]): for j in range(B.shape[0]): pairwise_distances[i, j] = np.linalg.norm(A[i] - B[j]) ``` When I run this, it takes several minutes to complete, and that's not feasible for my application. I also tried using vectorized operations to compute the distances, but the resulting code is still quite slow. Here's what I came up with: ```python A_expanded = A[:, np.newaxis, :] B_expanded = B[np.newaxis, :, :] # Vectorized distance computation pairwise_distances = np.sqrt(np.sum(A_expanded**2, axis=2) + np.sum(B_expanded**2, axis=2).T - 2 * np.dot(A, B.T)) ``` While this is faster than the double loop, it's still taking a considerable amount of time and memory. When I run it, I end up using almost 2.4 GB of RAM just for the intermediate arrays, which isn't ideal. I've read that avoiding the explicit use of `np.linalg.norm` could lead to better performance, but I'm not sure how to implement that without losing accuracy. Could someone suggest a more efficient way to compute pairwise Euclidean distances for large datasets using NumPy? I'm looking for a method that reduces both runtime and memory usage while maintaining precision. This is part of a larger CLI tool I'm building. What am I doing wrong? Any examples would be super helpful. I'm working with Python in a Docker container on Ubuntu 20.04.