-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sinusoidal Learning rate when increasing both max_epochs and batch size #107
Comments
Hi @JJrodny and @gvoysey, thanks for the well-detailed issue, as always 😉
Yes, you are correct, your effective batch size will be To be 100% clear, because someone just posted a question related to this: #108. Your effective batch size (ie the number of subgraphs on which your weights will be updated) will be: For your information, another way of increasing the size of your batch is to increase the radius of the sampled subgraphs by playing with OK, back to your actual issue now. A consequence of gradient accumulation is that, if you want to train for as many iterations (ie model weights updates) as before, you will need to also adjust the number of epochs you train for. Said otherwise, if you x2 In doing so, I also remember coming across the same issue you have with
Please let me know, if this helps ! If not, I will try to reproduce this behavior. I know this part of the code is a bit shady 😅 |
Thank you for the detailed reply! We're still trying to figure it out, but after trying a few different combinations of different values of At first I was looking at graphs in wandb measuring learning rate against the default x value: That suggests to me that the
I have two questions that have come up while doing this:
My ultimate goal in this is to have the learning rate drop down to 0 at the end of training, no matter what we set the max number of epochs to. Thanks for all of your help! |
Hi @JJrodny I do not have access to any SPT-ready machine at the moment, will try to look into this by Friday. |
Hi @JJrodny I have not forgotten about your issue, but still haven't found time to look into yet, sorry ! |
HI @JJrodny I have not been able to reproduce your error. Are you training from scratch or are you starting training from a pre-trained checkpoint file ? If the latter, then this is likely the source of the problem. |
yep! we’re starting with dales pretrained and fine tuning on top.
…On Tue, May 28, 2024 at 04:10 Damien ROBERT ***@***.***> wrote:
HI @JJrodny <https://github.com/JJrodny> I have not been able to
reproduce your error. Are you training from scratch or are you starting
training from a pre-trained checkpoint file ? If the latter, then this is
likely the source of the problem.
—
Reply to this email directly, view it on GitHub
<#107 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA3ZD35ZFA3DOBD4TEWXJVLZEQ3XLAVCNFSM6AAAAABH45ZSQKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZUGU4TSMRWHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK so that's where the problem comes from. I do not support fine-tuning yet, I will need to work on this ! In the meantime, if your dataset is large enough, I would advise just training from scratch. If not, then you will need to tweak the optimizer and scheduler to be well behaved. I know this is a feature you need, I will try to make time to work on this soon, sorry for the delay 😖 |
I was right since the start, I don't exactly know how you are loading your checkpoint but you are probably doing this:
To boil it down, you have to load the checkpoint AND override with your criterion parameters, What solved the problem for me was to call the load_from_checkpoint with all the args that was provided in the model config so it overrides the optimizer parameters . |
Thank you @drprojects, @gardiens, (and @gvoysey)! it was exactly that that was the problem. We're training from a pretrained model using It took me quite some fiddling trying to find the optimizer and scheduler to pass in and after learning some hydra it turned out I just needed to instantiate the new scheduler and optimizer from the config with hydra and pass that in.
@drprojects I'm so sorry to waste your time and make you boot up the code only for the problem not to be reproducible on your machine, I have been there and I completely understand how that feels! Thank you all for your help! Problem solved! |
Happy that you found a workaround and thanks for the detailed feedback ! |
I'd like to start this ticket off with: This repo is amazing! We're just starting to train models with it and your work is SotA and the weights are so tiny! Thank you for all of your hard work!
The current issue @gvoysey and I are working on is training for longer and with larger batch sizes.
Please correct my understanding if any of this is wrong.
To increase the batch sizes on the GPU RAM, I go into the specific
configs/experiment/semantic/*.yaml
file that I'm using and increasesample_graph_k
:As I understand it, increasing this number increases the nag batch size loaded onto the GPU.
Additionally, I can also increase the batch size theoretically without overloading the GPU, by increasing the
gradient_accumulator
:By implementing both of these, I now have a batch size of 40 - every batch of 4 run on the GPU together, but we don't update the gradient until we get 10 of these batches, so it allows us to train on larger batches (40) than our GPU can fit (4).
Now, I also want to train for longer.
In the comments for max_epochs there's a line:
# to keep same nb of steps: 25/9x more tiles, 2-step gradient accumulation -> epochs * 2 * 9 / 25
which implies if I train with a larger batch I need to reduce
max_epochs
, and if I train with a smaller batch, I increasemax_epochs
.But if I want to train for more epochs while also increasing gradient accumulation or batch size, I can just increase
max_epochs
, right?The problem we're coming up to is that while increasing
max_epochs
does allow us to train for longer, the learning rate (src.optim.CosineAnnealingLRWithWarmup
) doesn't go down to zero atmax_epochs
. Instead it goes down and back up in a sinusoidal way, where it first goes down to zero at the value of max_epochs I should have set it to based on the formula in the comment above that takes into account the modified batch size.This implies we should only train a model for a set number of iterations. Is that true? Can we change that?
In other words, how do we increase the
max_epochs
and keep our batch size large, while having the learning rate appropriately decrease to zero at themax_epochs
value?Here's an example of the wandb output in text format of the learning rate (for max_epochs set to 2000, and effective batch size of 40 as outlined above):
And attached is a graph of the learning rate:
Any help or advice would be appreciated! Thank you so much Damien!
The text was updated successfully, but these errors were encountered: