Performance issues with nested loops when filtering large datasets in Pandas

👀 Views: 83 💬 Answers: 1 📅 Created: 2025-07-17

pandas performance dataframe optimization Python

I'm experiencing significant performance issues when attempting to filter a large DataFrame using nested loops in Pandas, specifically with version 1.5.0. The DataFrame contains millions of rows, and my current approach using nested loops is extremely slow. I have the following code snippet: ```python import pandas as pd # Sample DataFrame creation for demonstration # Let's assume 'data' is a large DataFrame with over a million rows data = pd.DataFrame({ 'A': range(1000000), 'B': range(1000000, 2000000), 'C': range(2000000, 3000000) }) # Nested loops to filter rows based on multiple conditions filtered_data = [] for i in range(len(data)): for j in range(len(data)): if data.iloc[i]['A'] < 500000 and data.iloc[j]['B'] > 1500000: filtered_data.append(data.iloc[i]) # Convert the list back to DataFrame filtered_data = pd.DataFrame(filtered_data) ``` This method is taking an extraordinarily long time to execute, and I’m getting an impression that this approach is not optimal, especially since it results in a quadratic time complexity. I tried using vectorized operations, but I'm unsure how to properly replace my nested loops. Can anyone suggest a more efficient method for filtering the DataFrame without resorting to these loops? I attempted to use `data[(data['A'] < 500000) & (data['B'] > 1500000)]`, but it doesn't capture the same logic I intended with the nested loops, leading to unexpected results. Any insights on optimizing this would be greatly appreciated!