CodexBloom - Programming Q&A Platform

Unexpected NaN values during training with TensorFlow 2.8 on custom dataset

👀 Views: 97 💬 Answers: 1 📅 Created: 2025-06-10
tensorflow machine-learning cnn Python

I'm stuck trying to I'm reviewing some code and I'm trying to configure I tried several approaches but none seem to work... I'm currently training a model using TensorFlow 2.8 to classify images from a custom dataset, but I've encountered an scenario where the loss values suddenly become NaN during training. I’ve ensured that my dataset is properly normalized and preprocessed, but I still can’t figure out what’s going wrong. My model is a simple CNN with two convolutional layers followed by a couple of dense layers at the end. Here’s the code snippet that initializes and compiles the model: ```python import tensorflow as tf from tensorflow.keras import layers, models model = models.Sequential([ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)), layers.MaxPooling2D(pool_size=(2, 2)), layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D(pool_size=(2, 2)), layers.Flatten(), layers.Dense(64, activation='relu'), layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) ``` During the training phase, I use the following code: ```python train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).batch(32) model.fit(train_dataset, epochs=10) ``` After a few epochs, the loss value suddenly jumps to NaN, and my training process halts with the following behavior: ``` ValueError: Tensor conversion failed due to incompatible shapes: Expected shape: (None, 10) but got array with shape (32, 1) ``` I’ve double-checked the shape of my training labels and ensured they are integers corresponding to the classes. I also tried using a different optimizer like SGD, but the scenario continues. I suspect it may be related to the initialization of the weights or perhaps the learning rate being too high. I’ve tried reducing the learning rate to 0.0001, but it didn’t resolve the scenario. Any insights on what could be causing this NaN behavior and how to fix it would be greatly appreciated! My development environment is Linux. My team is using Python for this service. Has anyone else encountered this? I'm working on a service that needs to handle this. Is there a better approach? How would you solve this? My development environment is Linux. Am I approaching this the right way? I'm working in a Linux environment.