CodexBloom - Programming Q&A Platform

Pandas: Trouble with groupby on large DataFrame causing MemoryError

👀 Views: 274 đŸ’Ŧ Answers: 1 📅 Created: 2025-06-11
pandas groupby memory-error dataframe Python

I'm optimizing some code but Hey everyone, I'm running into an issue that's driving me crazy. I am working with a large DataFrame of around 10 million rows and need to group by multiple columns to calculate the average of a specific numeric column. However, when I try to execute the `groupby` operation, I encounter a `MemoryError`. Here is the code snippet I am using: ```python import pandas as pd # Simulating a large DataFrame rows = 10000000 data = { 'category': ['A', 'B', 'C'] * (rows // 3), 'sub_category': ['X', 'Y', 'Z'] * (rows // 3), 'value': range(rows) } df = pd.DataFrame(data) # Grouping and aggregating result = df.groupby(['category', 'sub_category'])['value'].mean().reset_index() ``` When I run this code, I receive the following behavior message: ``` MemoryError: Unable to allocate 10.5 GiB for an array with shape (3, 3333333) and data type float64 ``` I have tried using the `chunksize` parameter in `read_csv()` while importing the data, but since the data is already loaded into a DataFrame, I am not sure how to proceed. Additionally, I attempted to optimize my DataFrame by using `astype()` to convert the `value` column to `float32`, but it did not alleviate the memory usage significantly. Is there a more memory-efficient way to perform this `groupby` operation on such a large dataset? Any help would be greatly appreciated! My development environment is macOS.