-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ketos train
repeats validation in a loop if early stopping comes too early
#402
Comments
ketos train
repeats validation in an endless (?) loop if early stopping comes before min_epochsketos train
repeats validation in a loop if early stopping comes before min_epochs
ketos train
repeats validation in a loop if early stopping comes before min_epochsketos train
repeats validation in a loop if early stopping comes too early
Can you tell me which pytorch-lightning version you're running? I can't reproduce the error on my side and all that logic is handled by ptl. |
I used a fresh venv with |
Pretrain also repeats the validation step a lot of times (more than 100?), but continues later. The test was started with
|
We just had this happen during a stage 11. I'm wondering if it could be a CPU-only thing (the machine this was on has no GPU). But I'm assuming @stweil that you were on GPU? |
I'm on vacation right now bbut will have a look first in the new year. It looks like a pytorch-lightning bug or at least some weird interaction with some of the kraken-custom callbacks. |
Yes, ketos was started with |
Sorry, for the lack of updates. It is indeed a pytorch-lightning bug (Lightning-AI/pytorch-lightning#16363). Until they push a fix upstream there is nothing to be done except not using the |
The training was started with at least 200 epochs and 20 tries to get a better model:
ketos train -f page -t list.train -e list.eval -o Juristische_Konsilien_Tuebingen+256 -d cuda:0 --augment --workers 24 -r 0.0001 -B 1 --min-epochs 200 --lag 20 -w 0 -s '[256,64,0,1 Cr4,2,8,4,2 Cr4,2,32,1,1 Mp4,2,4,2 Cr3,3,64,1,1 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5 Cr255,1,85,1,1]'
Early stopping would have stopped after stage 111, but training continues because at least 200 was requested.
Instead of producing stage 112, 113, 114, ..., it stays at stage 112 and repeats the validation step again and again:
The text was updated successfully, but these errors were encountered: