`ketos train` repeats validation in a loop if early stopping comes too early #402

stweil · 2022-11-26T18:46:01Z

The training was started with at least 200 epochs and 20 tries to get a better model:

ketos train -f page -t list.train -e list.eval -o Juristische_Konsilien_Tuebingen+256 -d cuda:0 --augment --workers 24 -r 0.0001 -B 1 --min-epochs 200 --lag 20 -w 0 -s '[256,64,0,1 Cr4,2,8,4,2 Cr4,2,32,1,1 Mp4,2,4,2 Cr3,3,64,1,1 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5 Cr255,1,85,1,1]'

Early stopping would have stopped after stage 111, but training continues because at least 200 was requested.
Instead of producing stage 112, 113, 114, ..., it stays at stage 112 and repeats the validation step again and again:

stage 109/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7366/7366 0:00:00 0:05:14 val_accuracy: 0.87676  early_stopping: 18/20 0.87974
stage 110/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7366/7366 0:00:00 0:05:18 val_accuracy: 0.87542  early_stopping: 19/20 0.87974
stage 111/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7366/7366 0:00:00 0:05:18 val_accuracy: 0.87760  early_stopping: 20/20 0.87974
stage 112/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/7366 -:--:-- 0:00:00  early_stopping: 20/20 0.87974Trainer was signaled to stop but the required `min_epochs=200` or `min_steps=None` has not been met. Training will continue...
stage 112/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11802/7366 0:00:00 0:08:02 val_accuracy: 0.87345  early_stopping: 20/20 0.87974
Validation  ━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223/826    0:00:40 0:00:16

The text was updated successfully, but these errors were encountered:

mittagessen · 2022-11-26T20:57:55Z

Can you tell me which pytorch-lightning version you're running? I can't reproduce the error on my side and all that logic is handled by ptl.

stweil · 2022-11-26T21:13:53Z

I used a fresh venv with pip install krakenand got pytorch-lightning 1.8.3.post1.

stweil · 2022-11-27T17:47:33Z

Pretrain also repeats the validation step a lot of times (more than 100?), but continues later. The test was started with
ketos pretrain -f page -t list.train -e list.eval -o pretrain -d cuda:0 --workers 24. A complete log output is here.

[...]
Adjusting learning rate of group 0 to 1.0000e-06.
stage 38/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116/116 0:00:00 0:04:21 loss: 325  early_stopping: 4/5 1843.54395
Adjusting learning rate of group 0 to 1.0000e-06.
stage 39/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116/116 0:00:00 0:04:20 loss: 320  early_stopping: 5/5 1843.54395
stage 40/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/116 -:--:-- 0:00:00  early_stopping: 5/5 1843.54395Trainer was signaled to stop but the required `min_epochs=100` or `min_steps=None` has not been met. Training will continue... will continue...
Adjusting learning rate of group 0 to 1.0000e-06.
stage 40/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:04 loss: 306  early_stopping: 0/5 1837.04565
Adjusting learning rate of group 0 to 1.0000e-06.
stage 41/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:05 loss: 301  early_stopping: 0/5 1836.88599
Adjusting learning rate of group 0 to 1.0000e-06.
stage 42/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:05 loss: 299  early_stopping: 1/5 1836.88599
Adjusting learning rate of group 0 to 1.0000e-06.
stage 43/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1442/116 0:00:00 0:09:05 loss: 292  early_stopping: 2/5 1836.88599
stage 44/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1266/116 0:00:00 0:09:03 loss: 296  early_stopping: 2/5 1836.88599
Validation ━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 5/13     0:00:48 0:00:50

sixtyfive · 2022-12-20T11:23:00Z

We just had this happen during a stage 11. I'm wondering if it could be a CPU-only thing (the machine this was on has no GPU). But I'm assuming @stweil that you were on GPU?

mittagessen · 2022-12-20T12:19:44Z

I'm on vacation right now bbut will have a look first in the new year. It looks like a pytorch-lightning bug or at least some weird interaction with some of the kraken-custom callbacks.

stweil · 2022-12-20T12:55:33Z

But I'm assuming @stweil that you were on GPU?

Yes, ketos was started with -d cuda:0.

mittagessen · 2023-02-01T23:54:01Z

Sorry, for the lack of updates. It is indeed a pytorch-lightning bug (Lightning-AI/pytorch-lightning#16363). Until they push a fix upstream there is nothing to be done except not using the --min-epochs option.

stweil changed the title ~~ketos train repeats validation in an endless (?) loop if early stopping comes before min_epochs~~ ketos train repeats validation in a loop if early stopping comes before min_epochs Nov 26, 2022

stweil changed the title ~~ketos train repeats validation in a loop if early stopping comes before min_epochs~~ ketos train repeats validation in a loop if early stopping comes too early Nov 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ketos train` repeats validation in a loop if early stopping comes too early #402

`ketos train` repeats validation in a loop if early stopping comes too early #402

stweil commented Nov 26, 2022

mittagessen commented Nov 26, 2022

stweil commented Nov 26, 2022

stweil commented Nov 27, 2022 •

edited

sixtyfive commented Dec 20, 2022

mittagessen commented Dec 20, 2022

stweil commented Dec 20, 2022

mittagessen commented Feb 1, 2023

ketos train repeats validation in a loop if early stopping comes too early #402

ketos train repeats validation in a loop if early stopping comes too early #402

Comments

stweil commented Nov 26, 2022

mittagessen commented Nov 26, 2022

stweil commented Nov 26, 2022

stweil commented Nov 27, 2022 • edited

sixtyfive commented Dec 20, 2022

mittagessen commented Dec 20, 2022

stweil commented Dec 20, 2022

mittagessen commented Feb 1, 2023

`ketos train` repeats validation in a loop if early stopping comes too early #402

`ketos train` repeats validation in a loop if early stopping comes too early #402

stweil commented Nov 27, 2022 •

edited