Issues with TensorFlow's ModelCheckpoint not saving the best model during training
I can't seem to get This might be a silly question, but I'm wondering if anyone has experience with I'm currently training a deep learning model using TensorFlow 2.9.1 and I've set up ModelCheckpoint to save the best model based on validation loss..... However, I noticed that it sometimes saves a model that isn't the best one according to the validation metrics at the end of each epoch. Here's the code snippet I used for setting up ModelCheckpoint: ```python from tensorflow.keras.callbacks import ModelCheckpoint checkpoint_callback = ModelCheckpoint( 'best_model.h5', monitor='val_loss', save_best_only=True, mode='min', verbose=1 ) ``` I have also included this callback in the `fit` method: ```python model.fit( train_data, train_labels, epochs=50, validation_data=(val_data, val_labels), callbacks=[checkpoint_callback] ) ``` Despite this, I often find that the last saved model is not the one with the lowest validation loss observed during training. For example, the logs show: ``` Epoch 10: val_loss improved from 0.1234 to 0.1210, saving model to best_model.h5 Epoch 11: val_loss did not improve from 0.1210 Epoch 12: val_loss did not improve from 0.1210 Epoch 13: val_loss improved from 0.1210 to 0.1150, saving model to best_model.h5 Epoch 14: val_loss did not improve from 0.1150 ``` I checked the file system, and `best_model.h5` was updated at epoch 13, but it seems that the model at epoch 10 had a lower validation loss. I've also ensured that `save_best_only` is correctly set to `True`. Is there something I might be missing, or is there a known issue with this callback that could lead to such behavior? Any help would be appreciated! This is part of a larger CLI tool I'm building. Any ideas what could be causing this? What's the best practice here? For reference, this is a production web app. I'm on Ubuntu 22.04 using the latest version of Python. I'm open to any suggestions. The stack includes Python and several other technologies.