Unexpected NaN Values in TensorFlow 2.12 While Using tf.function with Custom Training Loop
I'm having a hard time understanding I'm performance testing and I'm updating my dependencies and I'm a bit lost with I'm working with an scenario where the loss values are producing NaN during training when I use a custom training loop with `tf.function` in TensorFlow 2.12..... I have implemented a simple feedforward neural network, and I'm using the Adam optimizer. Hereโs the relevant snippet of my code: ```python import tensorflow as tf import numpy as np # Sample data X_train = np.random.rand(1000, 10).astype(np.float32) Y_train = np.random.rand(1000, 1).astype(np.float32) # Define a simple model model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(1) ]) optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) # Custom training step @tf.function def train_step(X, Y): with tf.GradientTape() as tape: predictions = model(X) loss = tf.keras.losses.mean_squared_error(Y, predictions) # Check for NaN in loss tf.debugging.check_numerics(loss, 'Loss is NaN') gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) return loss # Training loop for epoch in range(10): loss = train_step(X_train, Y_train) print(f'Epoch {epoch}, Loss: {loss.numpy()}') ``` The training loop seems to work fine initially, but after a few epochs, the output for loss starts showing as NaN. Iโve tried implementing gradient clipping using `tf.clip_by_value` on the gradients, but it doesn't seem to help. Iโve also checked the data for any NaN or Inf values, and everything seems fine. Hereโs the debug output I receive when the loss turns NaN: ``` Tensor: tf.Tensor(nan, shape=(), dtype=float32) % Loss is NaN ``` Could this scenario be related to the specific operations within `train_step`, or is there something else I should be checking? Any insights would be greatly appreciated! What's the best practice here? For context: I'm using Python on Windows 10. Has anyone else encountered this? For context: I'm using Python on Windows 10. Cheers for any assistance! My development environment is Windows 11. The stack includes Python and several other technologies. Thanks for your help in advance!