OCI Data Science: How to Handle Large Datasets with Python SDK Version 2.12.0

👀 Views: 281 💬 Answers: 1 📅 Created: 2025-06-11

Hey everyone, I'm running into an issue that's driving me crazy. I'm working on a project where I need to train a machine learning model using a large dataset stored in OCI Object Storage. I've been using the Python SDK version 2.12.0 to interact with the Object Storage, but I'm working with performance optimization when trying to read the data. The dataset is around 10 GB, and when I attempt to load it directly into a pandas DataFrame, I receive a memory behavior: `MemoryError: Unable to allocate array with shape (1000000, 10) and data type float64`. To address this, I've tried using the `oci.object_storage` client to download the file in chunks, but I'm not sure how to implement this effectively. Here's the code I've been using to download the file: ```python import oci import pandas as pd config = oci.config.from_file() # Load config from ~/.oci/config object_storage_client = oci.object_storage.ObjectStorageClient(config) bucket_name = 'my_bucket' object_name = 'large_dataset.csv' namespace = object_storage_client.get_namespace().data response = object_storage_client.get_object(namespace, bucket_name, object_name) with open('large_dataset.csv', 'wb') as f: f.data.read_into(f) ``` While this code successfully downloads the file, I find that loading it into pandas is still causing memory issues. I've also explored using `chunksize` in the `pd.read_csv` method, but that didn't work with the local file after downloading. How can I efficiently handle this large dataset and avoid memory errors when loading it into pandas for processing? Are there best practices for dealing with large datasets in OCI Data Science that I should consider? I'm using Python 3.11 in this project. What am I doing wrong?