GCP AI Platform Training Jobs scenarios with 'Resource Exceeded' scenarios When Using TensorFlow 2.5 and Custom Container
I'm currently working with an scenario when trying to train a model using Google Cloud AI Platform with TensorFlow 2.5 in a custom container. The training job fails with the behavior message: `ResourceExhaustedError: OOM when allocating tensor of shape` despite specifying sufficient resources in my configuration. I've defined the training job like this: ```python from google.cloud import aiplatform aiplatform.init(project='my-project', location='us-central1') def train_model(): job = aiplatform.CustomTrainingJob( display_name='my-training-job', worker_pool_specs=[ { 'machine_spec': { 'machine_type': 'n1-standard-8', 'accelerator_type': 'NVIDIA_TESLA_V100', 'accelerator_count': 1, }, 'replica_count': 1, 'python_package_spec': { 'executor_image_uri': 'gcr.io/my-project/my-custom-image:latest', 'package_uris': ['gs://my-bucket/my_package-0.1.tar.gz'], 'python_module': 'my_package.my_module', }, } ], ) job.run(sync=True) train_model() ``` I've tried increasing the `machine_type` to `n1-highmem-8`, and even to `n1-highcpu-16`, but the behavior continues. Additionally, I've verified that my custom container successfully runs the training script locally with the same dataset and parameters without any memory issues. The training data is stored in a BigQuery table, and I'm utilizing the BigQuery Dataflow connector to stream data directly into the training job. Is there a specific configuration or best practice I might be missing related to memory management in Google Cloud AI Platform? Any insights on this would be greatly appreciated! This is happening in both development and production on macOS. Any suggestions would be helpful.