implementing Slow Performance When Using Spark SQL Over Large Parquet Files

👀 Views: 0 💬 Answers: 1 📅 Created: 2025-06-14

apache-spark spark-sql performance Python

I'm building a feature where I'm testing a new approach and I'm maintaining legacy code that I've been struggling with this for a few days now and could really use some help. I'm stuck on something that should probably be simple. I'm working with a important performance scenario while running Spark SQL queries on large Parquet files in Spark 3.4.0. My dataset consists of several large Parquet files (approximately 100GB total) stored in S3, and I am trying to perform a simple aggregation over one of the columns. The query looks something like this: ```sql SELECT category, SUM(sales) as total_sales FROM sales_data GROUP BY category ``` I've tried using `spark.sql.shuffle.partitions` to adjust the number of partitions, but it doesn't seem to have a noticeable impact on performance. Initially, I set it to 200, but after some testing, I increased it to 400, hoping it would help with the repartitioning during the aggregation. However, the query still takes upwards of 15 minutes to complete. I also attempted to enable predicate pushdown by adding filters directly to my DataFrame before running the SQL query, but this didn't yield any improvement either. Here’s a snippet of how I’m loading the data: ```python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('SalesDataAnalysis') \ .getOrCreate() sales_df = spark.read.parquet('s3://my-bucket/sales_data/') ``` Additionally, I'm not seeing any important skew in the data, as the categories seem to be fairly evenly distributed. I also checked the size of my DataFrame after loading: ```python sales_df.count() # returns approximately 1 million rows ``` I’ve monitored the Spark UI and noticed that the tasks are taking a long time to execute, mainly due to high shuffle read and write times, which seem to indicate that the data is not being processed as efficiently as it could be. Could there be specific configurations or optimizations that I’m overlooking? Any advice or insights on improving the performance of this query would be greatly appreciated! For context: I'm using Python on macOS. This is for a REST API running on Linux. I'm open to any suggestions. This is part of a larger web app I'm building. Any advice would be much appreciated. I'm on macOS using the latest version of Python. For context: I'm using Python on macOS. What would be the recommended way to handle this?