Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Continuing training assumes step 0 as start for cosine curve #297

Open
ejektaflex opened this issue May 8, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@ejektaflex
Copy link

What happened?

I just trained a model with a constant scheduler and Prodigy optimizer. Now, I'd like to finish it off with some fine detail by changing the scheduler to cosine_with_hard_restarts, since Prodigy is meant to be used with constant or cosine schedulers. However, upon continuing, the cosine function that the scheduler is following assumes that the height of the cosine curve is at step 0, which is incorrect. This means that my LoRa finetune comes out underbaked and does not learn details appropriately.

Here is an image to demonstrate:
image

This means that if someone trains for 90 epochs at a LR of, for example, 1.0, and then switches to cosine scheduler for the last 10 epochs, the last 10 epochs are going to shift from roughly 0.1 to 0.0 instead of 1.0 to 0.0. I generally think that most users do not want this - it means that the smaller the percentage of the total time you want to spend fine tuning with a cosine scheduler, the less impactful your fine tuning is. In order words, the later you push off fine tuning, the worse the fine tuning will be (since the LR will be too low to be impactful)

What did you expect would happen?

Since 50% of my steps are done with CONSTANT, the remaining 50% with COSINE_WITH_HARD_RESTARTS starts about half as high as I would expect. If I continue training with COSINE_WITH_HARD_RESTARTS, I'd expect to see something closer to this red line:
image

Note that Prodigy realized that the LR was too low itself and bumped itself up a bit for this very reason.

Relevant log output

No response

Output of pip freeze

No response

@ejektaflex ejektaflex added the bug Something isn't working label May 8, 2024
@Nerogar
Copy link
Owner

Nerogar commented May 8, 2024

The most common use case of backups is to continue a training run without changing any settings. The meta.json file in the backup directory saves the exact step number. If you stop in the middle of a cosine scheduled run, then continue from that backup, the learning rate will not increase to 100% again. The same happens when you change the scheduler. It assumes that all previous steps were already trained with the same scheduler.

You can manually change the meta.json file and set all values back to 0. That will restart the schedule

@ejektaflex
Copy link
Author

Yes, I understand. Is there a good reason why we assume that previous steps were trained with the same scheduler?

I understand that this is intentional for backup purposes, though while it might be the most common case I imagine there are some who also will continue training as I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants