Apache Spark 3.4.0 - Unresponsive Behavior When Writing Streaming DataFrames to Kafka with Multiple Partitions
I've hit a wall trying to I'm using Apache Spark 3.4.0 to write streaming data to Kafka, but I am experiencing unresponsive behavior when trying to write DataFrames that contain multiple partitions. The streaming job seems to hang indefinitely without any apparent errors in the logs, and I need to figure out what might be causing this scenario. I've already tried increasing the batch interval and reducing the number of partitions, but the question continues. Here's a simplified version of my code: ```scala import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ val spark = SparkSession.builder() .appName("Kafka Streaming") .getOrCreate() val inputDF = spark.readStream .format("socket") .option("host", "localhost") .option("port", "9999") .load() val query = inputDF.writeStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("topic", "my-topic") .option("checkpointLocation", "/tmp/kafka-to-kafka-checkpoint") .outputMode("append") .start() query.awaitTermination() ``` When I check the Kafka topic, no records are being produced, and the job just seems to hang. The input data being read from the socket is valid, and I have verified that Kafka is running properly. Additionally, I ensured that I have the correct dependencies in my build.sbt file: ```sbt libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.4.0" ``` I also tried to monitor the Spark UI, but it does not show any tasks being executed. I suspect there might be an scenario with how the DataFrame is being partitioned or how the streaming write is being handled, but I'm not sure how to diagnose it further. Has anyone faced similar issues with Kafka and Spark streaming, and what steps can I take to troubleshoot or resolve this? I'm on Ubuntu 22.04 using the latest version of Scala. Any ideas what could be causing this?