'save_total_limit' not respected

#14

by enricoburi - opened Jun 20

Jun 20

Hello,

I am fine-tuning this model for a classification task using the Trainer class.

Among the trainer arguments, I am setting 'save_total_limit'=1.
I am using the exact same code I have been using for 'bert-base-uncased' and 'ModernBERT-base', and for which everything works as expected.

However, in this case, the parameter in object seems to be completely ignored: as the training progresses, no checkpoint seems to ever be deleted. Given the size of the model, this becomes problematic pretty quickly.
Any idea what this might be related to?

I am working in JupyterLab on AWS SageMaker Studio and using transformers 4.51.3.

Keep up with the great work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment