-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? #987
Comments
Hi @jaywongs, you are more than well equipped to run any of our recipes with that configuration :) our documentation mentions some common hardware that users may run recipes with, anything beefier than what's mentioned should certainly be good |
Thank for your quick reply! |
@jaywongs yep you should be able to without any problems. Just give
a try (after downloading the model). Or if you want you can go all the way to |
@jaywongs If you're talking about 70B, then this should also be possible. I use the following config (with the PagedAdamW optimizer from bitsandbytes) with the Llama2 70B model for full finetuning which takes ~51GB of memory on 8 80GB A100s. So this should be able to handle the larger vocab size with Llama3 70B as well. You'll need to make some updates including pointing to the right checkpoint files and dir etc. Let me know if this works for you. Link: https://gist.github.com/kartikayk/a9bea0cca013a0f75e8859ad3250013c |
@kartikayk Is there a config linked? Don't think we can use PagedAdamW with FSDP unfortunately |
@rohan-varma yeh I copied the config in the gist above. I've been training 70B full with this setting for Llama2 (screenshot attached)
I dont think this is true? BnB had a recent release which supports FSDP, though 8-bit is WIP. I think this should extend to Llama3 as well since it takes ~52GB of memory. |
Thank you so much, i'll give it a try!
|
Oh ok, I think perhaps the initial version of the comment didn't have the gist linked. Curious if you've tried optimizer state save / load with FSDP + bnb optimizers? Last @ebsmothers and I checked, this was failing, which is why the bnb optimizers weren't shipped in any of the llama3 recipes, even though they helped memory. This might be a situation where it's failing for the 8 bit but passes for PagedOptimizers because the latter doesn't introduce any additional state. |
Yeah @kartikayk can correct me if I'm wrong, but I believe he was actually able to save the checkpoint properly with PagedAdamW + FSDP |
That sounds good. If we do ship such a recipe, we should be careful to document that the 8 bit low precision optimizers aren't supported, because it'll be pretty easy for users to swap about PagedAdamW for PagedAdamW8bit and not realize checkpointing will break. This is of course still possible in today's set up, so perhaps we should just hard error for now when FSDP + low precision optimizer is detected. |
@kartikayk @ebsmothers Have you been able to train llama3-70b on 8xA100 without CPU offload? Even with PagedOptimizer, I'm getting an OOM in the backward pass so think we need CPU offload for llama3-70b at least. |
@rohan-varma I haven't run it myself so would defer to your experience here. |
I'm not sure if it's possible. The document didn't mention that, perhaps?
The text was updated successfully, but these errors were encountered: