Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? #987

jaywongs · 2024-05-16T03:05:22Z

I'm not sure if it's possible. The document didn't mention that, perhaps?

RdoubleA · 2024-05-16T03:12:39Z

Hi @jaywongs, you are more than well equipped to run any of our recipes with that configuration :) our documentation mentions some common hardware that users may run recipes with, anything beefier than what's mentioned should certainly be good

jaywongs · 2024-05-16T03:17:16Z

Hi @jaywongs, you are more than well equipped to run any of our recipes with that configuration :) our documentation mentions some common hardware that users may run recipes with, anything beefier than what's mentioned should certainly be good

Thank for your quick reply!
I apologize for the confusion. What I actually meant was a full fine-tuning, but not specifically Lora or QLora.

ebsmothers · 2024-05-16T12:40:37Z

@jaywongs yep you should be able to without any problems. Just give

tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full

a try (after downloading the model). Or if you want you can go all the way to --nproc_per_node 8 to get larger batch size(/faster training).

kartikayk · 2024-05-16T17:42:45Z

@jaywongs If you're talking about 70B, then this should also be possible. I use the following config (with the PagedAdamW optimizer from bitsandbytes) with the Llama2 70B model for full finetuning which takes ~51GB of memory on 8 80GB A100s. So this should be able to handle the larger vocab size with Llama3 70B as well. You'll need to make some updates including pointing to the right checkpoint files and dir etc. Let me know if this works for you.

Link: https://gist.github.com/kartikayk/a9bea0cca013a0f75e8859ad3250013c

rohan-varma · 2024-05-17T01:16:34Z

@kartikayk Is there a config linked? Don't think we can use PagedAdamW with FSDP unfortunately

kartikayk · 2024-05-17T01:58:59Z

@rohan-varma yeh I copied the config in the gist above. I've been training 70B full with this setting for Llama2 (screenshot attached)

Don't think we can use PagedAdamW with FSDP unfortunately

I dont think this is true? BnB had a recent release which supports FSDP, though 8-bit is WIP. I think this should extend to Llama3 as well since it takes ~52GB of memory.

jaywongs · 2024-05-17T02:04:43Z

Thank you so much, i'll give it a try!

@jaywongs If you're talking about 70B, then this should also be possible. I use the following config (with the PagedAdamW optimizer from bitsandbytes) with the Llama2 70B model for full finetuning which takes ~51GB of memory on 8 80GB A100s. So this should be able to handle the larger vocab size with Llama3 70B as well. You'll need to make some updates including pointing to the right checkpoint files and dir etc. Let me know if this works for you.

Link: https://gist.github.com/kartikayk/a9bea0cca013a0f75e8859ad3250013c

rohan-varma · 2024-05-17T06:52:05Z

@rohan-varma yeh I copied the config in the gist above. I've been training 70B full with this setting for Llama2 (screenshot attached)

Don't think we can use PagedAdamW with FSDP unfortunately

I dont think this is true? BnB had a recent release which supports FSDP, though 8-bit is WIP. I think this should extend to Llama3 as well since it takes ~52GB of memory.

Oh ok, I think perhaps the initial version of the comment didn't have the gist linked.

Curious if you've tried optimizer state save / load with FSDP + bnb optimizers? Last @ebsmothers and I checked, this was failing, which is why the bnb optimizers weren't shipped in any of the llama3 recipes, even though they helped memory. This might be a situation where it's failing for the 8 bit but passes for PagedOptimizers because the latter doesn't introduce any additional state.

ebsmothers · 2024-05-17T12:59:56Z

Curious if you've tried optimizer state save / load with FSDP + bnb optimizers? Last @ebsmothers and I checked, this was failing, which is why the bnb optimizers weren't shipped in any of the llama3 recipes, even though they helped memory. This might be a situation where it's failing for the 8 bit but passes for PagedOptimizers because the latter doesn't introduce any additional state.

Yeah @kartikayk can correct me if I'm wrong, but I believe he was actually able to save the checkpoint properly with PagedAdamW + FSDP

rohan-varma · 2024-05-17T17:39:15Z

That sounds good. If we do ship such a recipe, we should be careful to document that the 8 bit low precision optimizers aren't supported, because it'll be pretty easy for users to swap about PagedAdamW for PagedAdamW8bit and not realize checkpointing will break. This is of course still possible in today's set up, so perhaps we should just hard error for now when FSDP + low precision optimizer is detected.

rohan-varma · 2024-05-17T18:07:47Z

@kartikayk @ebsmothers Have you been able to train llama3-70b on 8xA100 without CPU offload? Even with PagedOptimizer, I'm getting an OOM in the backward pass so think we need CPU offload for llama3-70b at least.

ebsmothers · 2024-05-22T04:16:54Z

@rohan-varma I haven't run it myself so would defer to your experience here.

jaywongs changed the title ~~Is it possible to fine tuning llama3 with torchtune on a machine with 8*A100 80G ?~~ Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? #987

Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? #987

jaywongs commented May 16, 2024

RdoubleA commented May 16, 2024

jaywongs commented May 16, 2024

ebsmothers commented May 16, 2024

kartikayk commented May 16, 2024 •

edited

rohan-varma commented May 17, 2024

kartikayk commented May 17, 2024 •

edited

jaywongs commented May 17, 2024

rohan-varma commented May 17, 2024 •

edited

ebsmothers commented May 17, 2024

rohan-varma commented May 17, 2024

rohan-varma commented May 17, 2024

ebsmothers commented May 22, 2024

Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? #987

Is it possible to full fine tuning llama3 70B with torchtune on a machine with 8*A100 80G ? #987

Comments

jaywongs commented May 16, 2024

RdoubleA commented May 16, 2024

jaywongs commented May 16, 2024

ebsmothers commented May 16, 2024

kartikayk commented May 16, 2024 • edited

rohan-varma commented May 17, 2024

kartikayk commented May 17, 2024 • edited

jaywongs commented May 17, 2024

rohan-varma commented May 17, 2024 • edited

ebsmothers commented May 17, 2024

rohan-varma commented May 17, 2024

rohan-varma commented May 17, 2024

ebsmothers commented May 22, 2024

kartikayk commented May 16, 2024 •

edited

kartikayk commented May 17, 2024 •

edited

rohan-varma commented May 17, 2024 •

edited