Unexpected NaN Values During Training with TensorFlow 2.8.0 When Using Custom Loss Function
I'm maintaining legacy code that I'm not sure how to approach Quick question that's been bugging me - I'm currently training a neural network using TensorFlow 2.8.0, and I've encountered an scenario where the loss values sometimes become NaN during training. This only seems to happen when I use a custom loss function that computes the mean squared behavior between the predicted and true values. Here's my implementation of the custom loss function: ```python import tensorflow as tf def custom_mse(y_true, y_pred): return tf.reduce_mean(tf.square(y_true - y_pred)) ``` I compile my model with this loss function as follows: ```python model.compile(optimizer='adam', loss=custom_mse) ``` In my training loop, I'm using the following code: ```python history = model.fit(x_train, y_train, epochs=50, validation_data=(x_val, y_val)) ``` The training data is normalized, and I double-checked for any potential infinite or NaN values in `x_train` and `y_train`, which seem to be fine. However, during the training process, I notice that the loss sometimes spikes to NaN, leading to a complete halt in training. I've tried adding gradient clipping with `tf.keras.optimizers.Adam(clipnorm=1.0)`, but that hasn't resolved the scenario. Could this be related to how gradients are computed in my custom loss function, or is there something wrong with my data preprocessing steps? Any insights or debugging tips would be greatly appreciated! What's the correct way to implement this? This is my first time working with Python stable. How would you solve this? Could this be a known issue?