TensorFlow 2.12 - Model Fine-tuning with Pre-trained BERT Results in NaN Loss During Training
I'm refactoring my project and I've been struggling with this for a few days now and could really use some help... I'm reviewing some code and I've been banging my head against this for hours. I'm working on a project and hit a roadblock. I'm working on a project and hit a roadblock. I'm attempting to fine-tune a BERT model using TensorFlow 2.12 for a text classification task, but I'm running into issues where the loss becomes NaN after a few epochs. I've followed the standard practices for loading the model and preparing the dataset, but despite multiple attempts, I need to seem to get stable training. Here's what I've tried: 1. I've used the `transformers` library to load the pre-trained BERT model, and I'm sure to set the `trainable` parameter to `True` for all layers during fine-tuning: ```python from transformers import TFBertForSequenceClassification model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) model.trainable = True ``` 2. I'm using Adam optimizer with a learning rate of 5e-5, which is common for BERT fine-tuning, but the loss still diverges: ```python from tensorflow.keras.optimizers import Adam optimizer = Adam(learning_rate=5e-5) ``` 3. I've added gradient clipping to prevent exploding gradients: ```python from tensorflow.keras import backend as K def gradient_clipping(optimizer): grads = optimizer.get_gradients(loss, model.trainable_weights) grads, _ = tf.clip_by_global_norm(grads, 1.0) optimizer.apply_gradients(zip(grads, model.trainable_weights)) ``` 4. I've also checked the input data for NaNs or infinite values, and all seems well with the pre-processing. Iām using a batch size of 16 with a `tf.data.Dataset`: ```python dataset = dataset.batch(16).prefetch(tf.data.AUTOTUNE) ``` 5. I tried logging the loss values, and I noticed that they go from reasonable values to NaN after a few updates, which makes me suspect either the learning rate is too high or some instability in the model itself. Could this be related to the model architecture, or is there an scenario with my setup? Any insights on how to stabilize the training process would be greatly appreciated! Is there a better approach? This is part of a larger API I'm building. I'm working on a CLI tool that needs to handle this. Has anyone else encountered this? This is part of a larger web app I'm building. This is my first time working with Python LTS. How would you solve this?