GCP Dataflow Job scenarios with 'Job Validation scenarios' on Aggregate Transform with Apache Beam 2.35.0
Does anyone know how to I'm getting frustrated with I'm relatively new to this, so bear with me. I tried several approaches but none seem to work. I'm running a Dataflow job using Apache Beam 2.35.0 to aggregate some streaming data from Pub/Sub, but I'm working with a 'Job Validation behavior' when the pipeline tries to execute an `Aggregate` transform. Hereβs a simplified version of my code: ```python import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions class AggregateFn(beam.CombineFn): def create_accumulator(self): return 0 def add_input(self, accumulator, input): return accumulator + input def merge_accumulators(self, accumulators): return sum(accumulators) def extract_output(self, accumulator): return accumulator options = PipelineOptions() with beam.Pipeline(options=options) as p: (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(subscription='projects/my-project/subscriptions/my-subscription') | 'Parse JSON' >> beam.Map(lambda x: json.loads(x)) | 'Extract Value' >> beam.Map(lambda x: x['value']) | 'Aggregate Values' >> beam.CombineGlobally(AggregateFn())) ``` When I run this, I see the following behavior in the Dataflow logs: ``` behavior: Job Validation behavior: The transform 'Aggregate Values' is not supported for the given input. ``` Iβve checked the input schema and ensured that the data being aggregated is numeric. This scenario seems to be arising specifically with the `CombineGlobally` transform. I tried using `CombinePerKey` instead to see if it made any difference, but I encountered similar issues. Iβve also verified that the Pub/Sub messages are indeed being consumed correctly by adding logging after the `Parse JSON` step, and I can see valid data coming through. I've looked through the documentation but need to find any clear guidance on what might be causing this validation behavior. Are there any specific requirements for using `CombineGlobally` with streaming sources in Dataflow that I might be overlooking? Any help would be appreciated! My development environment is macOS. What's the best practice here? For context: I'm using Python on Ubuntu. Am I missing something obvious? For reference, this is a production mobile app. Any feedback is welcome! Hoping someone can shed some light on this.