CodexBloom - Programming Q&A Platform

GCP Dataflow Pipeline Failing with 'java.lang.IllegalArgumentException: Invalid input data' Error on Avro Files

👀 Views: 64 đŸ’Ŧ Answers: 1 📅 Created: 2025-09-06
gcp dataflow avro bigquery Java

I'm converting an old project and I've tried everything I can think of but I'm currently working on a Google Cloud Dataflow pipeline that reads Avro files from a Google Cloud Storage bucket... The intention is to process these files and write the output to BigQuery. However, I'm encountering a persistent error that states `java.lang.IllegalArgumentException: Invalid input data` when I run the pipeline. I've cross-verified the schema of the Avro files, and they seem to match the expected schema in my pipeline configuration. Here's a snippet of how I'm defining the pipeline: ```java PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); p.apply("ReadAvroFiles", AvroIO.read(MyAvroClass.class) .from("gs://my-bucket/data/*.avro")) .apply("TransformData", ParDo.of(new MyTransformFn())) .apply("WriteToBigQuery", BigQueryIO.writeTableRows() .to("my-project:my_dataset.my_table") .withSchema(myBQSchema) .withWriteDisposition(WriteDisposition.WRITE_APPEND)); p.run().waitUntilFinish(); ``` To troubleshoot, I validated the Avro files using the Avro tools library and they are correctly formatted. However, the error persists when executing the Dataflow job. As a next step, I tried specifying the Avro schema explicitly in the `AvroIO.read` method, but it didn't resolve the issue. Additionally, I have ensured that the GCP service account has the necessary permissions to access both Google Cloud Storage and BigQuery. Could this issue be related to the specific Avro file version or some incompatibility between my Avro files and the Dataflow pipeline? Any insights or suggestions on how to debug this would be greatly appreciated. This is part of a larger service I'm building. Has anyone else encountered this? This is for a REST API running on Ubuntu 22.04. Could someone point me to the right documentation? This is part of a larger web app I'm building. Any feedback is welcome!