Performance implementing Pandas DataFrame Iteration and Custom Row Manipulation
I'm following best practices but I've searched everywhere and can't find a clear answer. I'm working with important performance optimization when trying to iterate over a large Pandas DataFrame (approximately 1 million rows) to apply custom row manipulations based on conditions. The current approach involves using `.iterrows()`, which I found to be quite slow. Hereβs a simplified version of the code I'm using: ```python import pandas as pd # Sample DataFrame creation rows = 10**6 data = {'A': range(rows), 'B': range(rows, 2 * rows)} df = pd.DataFrame(data) # Custom function to manipulate rows def custom_row_manipulation(row): if row['A'] % 2 == 0: return row['B'] + 10 return row['B'] - 10 # Iterating over DataFrame and applying the function for index, row in df.iterrows(): df.at[index, 'B'] = custom_row_manipulation(row) ``` This code works but takes an unacceptably long time to execute. I realize that `.iterrows()` is not the most efficient way to handle this, especially for larger DataFrames. I've tried using `apply()` instead, but I still encounter performance bottlenecks: ```python df['B'] = df.apply(custom_row_manipulation, axis=1) ``` I also attempted to vectorize the operation using NumPy, but I got exploring trying to maintain the conditional logic. Iβm not sure how to efficiently apply the row manipulations without compromising performance. Is there a better approach to achieve this? Any advice on optimizing my DataFrame manipulation would be greatly appreciated. I'm using Pandas version 1.3.0 with Python 3.8. My development environment is Ubuntu. I recently upgraded to Python 3.9. Could this be a known issue?