Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the full batch size if mesh_dim is set to 1,1,-1, on TPUv3-8? #94

Open
TPFRL opened this issue Sep 23, 2023 · 3 comments
Open

Comments

@TPFRL
Copy link

TPFRL commented Sep 23, 2023

Hi, thanks for this amazing repo.
I was wondering how should I set batch size to make a desirable full batch size.

For example, if I set train_dataset.huggingface_dataset.batch_size to 1 on TPUv3-8,
what is the full batch size given mesh_dim 1,1,-1 / 1,-1,1 / -1,1,1 ?
Are all of them 8? or 1?

Thanks!

@young-geng
Copy link
Owner

Different mesh dims correspond to different sharding strategies. While they do not define a batch size, they do incur certain constraints on the possible batch size.

  • 1,1,-1 corresponds to tensor parallelism only, and you can use any batch size you want
  • 1,-1,1 corresponds to full FSDP, this means that your batch size needs to a multiple of number of devices (8 here)
  • -1,1,1 corresponds to full DP, this also means that your batch size needs to a multiple of number of devices (8 here)

@jcole75
Copy link

jcole75 commented Oct 4, 2023

Hi, thanks for this amazing repo. I was wondering how should I set batch size to make a desirable full batch size.

For example, if I set train_dataset.huggingface_dataset.batch_size to 1 on TPUv3-8, what is the full batch size given mesh_dim 1,1,-1 / 1,-1,1 / -1,1,1 ? Are all of them 8? or 1?

Thanks!

Did you get this to run with a v3? I seem to always get HLO out of memory errors.

@young-geng
Copy link
Owner

A single v3-8 only has 128GB of memory in total, which might not be sufficient for training a 7B model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants