Unexpected NaN values in TensorFlow model loss during training with custom dataset
I'm a bit lost with I've tried everything I can think of but I'm following best practices but I'm maintaining legacy code that Hey everyone, I'm running into an issue that's driving me crazy... I'm currently training a TensorFlow model (v2.11) for image classification, but I've encountered an scenario where the loss suddenly becomes NaN during training. I've verified that my dataset is properly loaded, normalized, and split into training and validation sets. I'm using the ImageDataGenerator for preprocessing, but I suspect there might be an scenario with how my labels are being handled. Here's a snippet of my data loading code: ```python from tensorflow.keras.preprocessing.image import ImageDataGenerator datagen = ImageDataGenerator( rescale=1./255, validation_split=0.2, ) train_generator = datagen.flow_from_directory( 'data/train/', target_size=(150, 150), batch_size=32, class_mode='categorical', subset='training' ) validation_generator = datagen.flow_from_directory( 'data/train/', target_size=(150, 150), batch_size=32, class_mode='categorical', subset='validation' ) ``` I've ensured that the images are all in .jpg format and the labels are consistent. However, after a few epochs, the loss becomes NaN, and I see warnings like this in my logs: ``` WARNING:tensorflow:Gradients do not exist for variables [<tf.Variable 'conv2d/kernel:0' shape=(3, 3, 3, 32) dtype=float32>, ...] when minimizing the loss. ``` I’ve also tried using different optimizers like Adam and SGD, but the question continues. I've checked the data for NaN values or any outliers, and everything seems fine. Can this scenario be related to the learning rate being too high or possibly an overflow during calculations? Any suggestions on how to debug this would be greatly appreciated! I'm also concerned about the overall architecture since I'm using a custom CNN model with three convolutional layers and dropout after each layer. Here's a simplified version of my model: ```python from tensorflow.keras import layers, models model = models.Sequential([ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)), layers.MaxPooling2D(2, 2), layers.Dropout(0.2), layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D(2, 2), layers.Dropout(0.2), layers.Conv2D(128, (3, 3), activation='relu'), layers.MaxPooling2D(2, 2), layers.Dropout(0.2), layers.Flatten(), layers.Dense(512, activation='relu'), layers.Dense(num_classes, activation='softmax') ]) ``` Any help on how to avoid NaN losses or properly diagnose this scenario would be greatly appreciated. My development environment is Linux. Hoping someone can shed some light on this. I'm working with Python in a Docker container on Ubuntu 20.04. Cheers for any assistance!