Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional approach with distributed training #5

Open
kevinlin311tw opened this issue Jan 18, 2022 · 3 comments
Open

functional approach with distributed training #5

kevinlin311tw opened this issue Jan 18, 2022 · 3 comments

Comments

@kevinlin311tw
Copy link

Thank you for the great work!

Could you please provide some examples about functional approach with distributed multi-gpu training?

@luyug
Copy link
Owner

luyug commented Jan 20, 2022

Hi @kevinlin311tw , sure, I can add an example in a day or two.

As a side note, the functional approach itself is actually agnostic to parallelism: you need only to wrap your encoder model and do cross process communication in the loss function. Maybe this comment will be helpful if you want to give it a try yourself.

@luyug
Copy link
Owner

luyug commented Jan 21, 2022

I've added an example in the readme, along with a new all-gather decorator that may be helpful.

Feel free to ping me if you have any questions or find any problems with the code.

@memray
Copy link

memray commented Sep 1, 2023

@luyug I wonder if you have any workable example using functional approach on DDP?
I ran into the error below

    surrogate.backward()
  File "/export/share/ruimeng/env/anaconda/envs/llm/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/export/share/ruimeng/env/anaconda/envs/llm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 195 with name q_encoder.encoder.encoder.layer.11.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

According to this post, DDP doesn't seem to support multiple feedforward/backward pass calls. Can you confirm the case and/or provide any solutions?

Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants