CodexBloom - Programming Q&A Platform

GCP Dataflow Pipeline scenarios with 'Input Split Not Found' scenarios When Using Apache Beam 2.30.0

👀 Views: 4481 đŸ’Ŧ Answers: 1 📅 Created: 2025-06-20
google-cloud-dataflow apache-beam gcp Python

I'm a bit lost with Quick question that's been bugging me - I'm stuck on something that should probably be simple..... I'm working with an scenario with my Google Cloud Dataflow pipeline where it fails during execution with an behavior stating `Input split not found for <file-path>`. I'm using Apache Beam version 2.30.0 and have set up a pipeline to read from a large dataset stored in Google Cloud Storage. The pipeline is relatively straightforward: it reads from a text file, processes the lines, and writes the output to another file in GCS. Here's a simplified version of the code: ```python import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions class ProcessLine(beam.DoFn): def process(self, element): # Simulate some processing yield element.upper() options = PipelineOptions( project='my-gcp-project', runner='DataflowRunner', temp_location='gs://my-bucket/temp', region='us-central1' ) with beam.Pipeline(options=options) as p: (p | 'Read From GCS' >> beam.io.ReadFromText('gs://my-bucket/input.txt') | 'Process Lines' >> beam.ParDo(ProcessLine()) | 'Write To GCS' >> beam.io.WriteToText('gs://my-bucket/output') ) ``` I have double-checked the file path and permissions, and the input file exists. When I run the pipeline, it consistently throws this behavior after a few seconds, even though other smaller datasets seem to process without scenario. I've tried switching the runner to `DirectRunner` for testing, and it works fine, but the DataflowRunner fails with the same behavior. Is this an scenario with the size of the dataset or perhaps related to how Dataflow handles input splits? Any insights on how to troubleshoot this or workarounds would be greatly appreciated. For context: I'm using Python on Linux. Any ideas what could be causing this? Thanks in advance! This issue appeared after updating to Python stable. Thanks for any help you can provide! Any ideas what could be causing this?