Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ketos train repeats validation in a loop if early stopping comes too early #402

Open
stweil opened this issue Nov 26, 2022 · 7 comments
Open

Comments

@stweil
Copy link
Contributor

stweil commented Nov 26, 2022

The training was started with at least 200 epochs and 20 tries to get a better model:

ketos train -f page -t list.train -e list.eval -o Juristische_Konsilien_Tuebingen+256 -d cuda:0 --augment --workers 24 -r 0.0001 -B 1 --min-epochs 200 --lag 20 -w 0 -s '[256,64,0,1 Cr4,2,8,4,2 Cr4,2,32,1,1 Mp4,2,4,2 Cr3,3,64,1,1 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5 Cr255,1,85,1,1]'

Early stopping would have stopped after stage 111, but training continues because at least 200 was requested.
Instead of producing stage 112, 113, 114, ..., it stays at stage 112 and repeats the validation step again and again:

stage 109/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7366/7366 0:00:00 0:05:14 val_accuracy: 0.87676  early_stopping: 18/20 0.87974
stage 110/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7366/7366 0:00:00 0:05:18 val_accuracy: 0.87542  early_stopping: 19/20 0.87974
stage 111/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7366/7366 0:00:00 0:05:18 val_accuracy: 0.87760  early_stopping: 20/20 0.87974
stage 112/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/7366 -:--:-- 0:00:00  early_stopping: 20/20 0.87974Trainer was signaled to stop but the required `min_epochs=200` or `min_steps=None` has not been met. Training will continue...
stage 112/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11802/7366 0:00:00 0:08:02 val_accuracy: 0.87345  early_stopping: 20/20 0.87974
Validation  ━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223/826    0:00:40 0:00:16                                                     
@stweil stweil changed the title ketos train repeats validation in an endless (?) loop if early stopping comes before min_epochs ketos train repeats validation in a loop if early stopping comes before min_epochs Nov 26, 2022
@stweil stweil changed the title ketos train repeats validation in a loop if early stopping comes before min_epochs ketos train repeats validation in a loop if early stopping comes too early Nov 26, 2022
@mittagessen
Copy link
Owner

Can you tell me which pytorch-lightning version you're running? I can't reproduce the error on my side and all that logic is handled by ptl.

@stweil
Copy link
Contributor Author

stweil commented Nov 26, 2022

I used a fresh venv with pip install krakenand got pytorch-lightning 1.8.3.post1.

@stweil
Copy link
Contributor Author

stweil commented Nov 27, 2022

Pretrain also repeats the validation step a lot of times (more than 100?), but continues later. The test was started with
ketos pretrain -f page -t list.train -e list.eval -o pretrain -d cuda:0 --workers 24. A complete log output is here.

[...]
Adjusting learning rate of group 0 to 1.0000e-06.
stage 38/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116/116 0:00:00 0:04:21 loss: 325  early_stopping: 4/5 1843.54395
Adjusting learning rate of group 0 to 1.0000e-06.
stage 39/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116/116 0:00:00 0:04:20 loss: 320  early_stopping: 5/5 1843.54395
stage 40/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/116 -:--:-- 0:00:00  early_stopping: 5/5 1843.54395Trainer was signaled to stop but the required `min_epochs=100` or `min_steps=None` has not been met. Training will continue... will continue...
Adjusting learning rate of group 0 to 1.0000e-06.
stage 40/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:04 loss: 306  early_stopping: 0/5 1837.04565
Adjusting learning rate of group 0 to 1.0000e-06.
stage 41/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:05 loss: 301  early_stopping: 0/5 1836.88599
Adjusting learning rate of group 0 to 1.0000e-06.
stage 42/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:05 loss: 299  early_stopping: 1/5 1836.88599
Adjusting learning rate of group 0 to 1.0000e-06.
stage 43/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:05 loss: 292  early_stopping: 2/5 1836.88599
stage 44/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1266/116 0:00:00 0:09:03 loss: 296  early_stopping: 2/5 1836.88599
Validation ━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 5/13     0:00:48 0:00:50                                        

@sixtyfive
Copy link
Contributor

We just had this happen during a stage 11. I'm wondering if it could be a CPU-only thing (the machine this was on has no GPU). But I'm assuming @stweil that you were on GPU?

@mittagessen
Copy link
Owner

I'm on vacation right now bbut will have a look first in the new year. It looks like a pytorch-lightning bug or at least some weird interaction with some of the kraken-custom callbacks.

@stweil
Copy link
Contributor Author

stweil commented Dec 20, 2022

But I'm assuming @stweil that you were on GPU?

Yes, ketos was started with -d cuda:0.

@mittagessen
Copy link
Owner

Sorry, for the lack of updates. It is indeed a pytorch-lightning bug (Lightning-AI/pytorch-lightning#16363). Until they push a fix upstream there is nothing to be done except not using the --min-epochs option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants