Inconsistent results with np.random.choice when using replace=False on large datasets in NumPy 1.24.2

👀 Views: 11 💬 Answers: 1 📅 Created: 2025-06-09

I'm migrating some code and I'm working through a tutorial and I've spent hours debugging this and I'm experiencing inconsistent results when using `np.random.choice` with the `replace=False` argument on large datasets. Specifically, when I try to sample without replacement from an array of size 1,000,000, I occasionally receive duplicate values in the output, which shouldn't happen. Here's the code I'm using: ```python import numpy as np # Create a large array of unique integers large_array = np.arange(1, 1000001) # Sample 10 values without replacement sampled_values = np.random.choice(large_array, size=10, replace=False) print(sampled_values) ``` I expected the `sampled_values` to always contain unique integers from `large_array`. However, sometimes I find duplicate values in the output when I run the code multiple times. I've ensured that the seed is set for reproducibility: ```python np.random.seed(42) ``` Additionally, I've tried using `np.random.choice` with a smaller array (e.g., 100 elements), and it works correctly, always giving unique values. It seems the scenario only arises with larger datasets. I've also considered potential edge cases with the sampling size being larger than the array size, but that's not the case here. Is there any known bug or limitation with `np.random.choice` in NumPy 1.24.2 when dealing with large arrays and sampling without replacement? Any insights would be appreciated! Thanks, I really appreciate it! This is my first time working with Python stable. Am I missing something obvious?