OCI Data Pipeline: How to Efficiently Transfer Large Data Sets from Object Storage to Autonomous Database Using OCI Data Flow
I'm working on a personal project and I'm converting an old project and I've spent hours debugging this and I can't seem to get I tried several approaches but none seem to work... I'm currently working on a data pipeline to move large datasets (up to 10 GB) from OCI Object Storage to an Autonomous Database using OCI Data Flow, and I am running into performance optimization. The data is in CSV format and requires some transformations before loading into the database. I've set up a Spark job in OCI Data Flow to handle this, but the job is timing out and failing with the behavior: `Job 1234567890 failed. Reason: Spark Task failed with exception: org.apache.spark.SparkException: Job aborted due to stage failure`. I've tried increasing the number of driver and executor nodes, but that didn't seem to help. Here is a simplified version of the code I'm using in my Spark job: ```python import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.appName('DataTransfer').getOrCreate() # Read CSV from Object Storage input_path = 'oci://bucket-name/path/to/data.csv' df = spark.read.csv(input_path, header=True, inferSchema=True) # Perform some transformations transformed_df = df.withColumn('new_column', df['existing_column'] * 2) # Write to Autonomous Database output_path = 'jdbc:oracle:thin:@hostname:port/service_name' transformed_df.write.format('jdbc') \ .option('url', output_path) \ .option('dbtable', 'schema.table_name') \ .option('user', 'username') \ .option('password', 'password') \ .mode('append') \ .save() ``` I've confirmed that the connection to the Autonomous Database works, and I can manually load smaller files without issues. The question seems to arise with larger datasets. I've also considered partitioning the data before writing it to the database, but I'm unsure how to implement that effectively within my current setup. Any suggestions on optimizing this process or specific configurations to try would be greatly appreciated! For context: I'm using Python on Windows. What's the best practice here? My development environment is Ubuntu. Any help would be greatly appreciated! I'm using Python stable in this project. For context: I'm using Python on macOS. This is my first time working with Python latest.