[Bug]: Continuing training assumes step 0 as start for cosine curve #297

ejektaflex · 2024-05-08T18:29:24Z

What happened?

I just trained a model with a constant scheduler and Prodigy optimizer. Now, I'd like to finish it off with some fine detail by changing the scheduler to cosine_with_hard_restarts, since Prodigy is meant to be used with constant or cosine schedulers. However, upon continuing, the cosine function that the scheduler is following assumes that the height of the cosine curve is at step 0, which is incorrect. This means that my LoRa finetune comes out underbaked and does not learn details appropriately.

Here is an image to demonstrate:

This means that if someone trains for 90 epochs at a LR of, for example, 1.0, and then switches to cosine scheduler for the last 10 epochs, the last 10 epochs are going to shift from roughly 0.1 to 0.0 instead of 1.0 to 0.0. I generally think that most users do not want this - it means that the smaller the percentage of the total time you want to spend fine tuning with a cosine scheduler, the less impactful your fine tuning is. In order words, the later you push off fine tuning, the worse the fine tuning will be (since the LR will be too low to be impactful)

What did you expect would happen?

Since 50% of my steps are done with CONSTANT, the remaining 50% with COSINE_WITH_HARD_RESTARTS starts about half as high as I would expect. If I continue training with COSINE_WITH_HARD_RESTARTS, I'd expect to see something closer to this red line:

Note that Prodigy realized that the LR was too low itself and bumped itself up a bit for this very reason.

Relevant log output

No response

Output of `pip freeze`

No response

The text was updated successfully, but these errors were encountered:

Nerogar · 2024-05-08T20:45:32Z

The most common use case of backups is to continue a training run without changing any settings. The meta.json file in the backup directory saves the exact step number. If you stop in the middle of a cosine scheduled run, then continue from that backup, the learning rate will not increase to 100% again. The same happens when you change the scheduler. It assumes that all previous steps were already trained with the same scheduler.

You can manually change the meta.json file and set all values back to 0. That will restart the schedule

ejektaflex · 2024-05-08T23:31:13Z

Yes, I understand. Is there a good reason why we assume that previous steps were trained with the same scheduler?

I understand that this is intentional for backup purposes, though while it might be the most common case I imagine there are some who also will continue training as I do.

ejektaflex added the bug Something isn't working label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Continuing training assumes step 0 as start for cosine curve #297

[Bug]: Continuing training assumes step 0 as start for cosine curve #297

ejektaflex commented May 8, 2024

Nerogar commented May 8, 2024

ejektaflex commented May 8, 2024

[Bug]: Continuing training assumes step 0 as start for cosine curve #297

[Bug]: Continuing training assumes step 0 as start for cosine curve #297

Comments

ejektaflex commented May 8, 2024

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

Nerogar commented May 8, 2024

ejektaflex commented May 8, 2024

Output of `pip freeze`