functional approach with distributed training #5

kevinlin311tw · 2022-01-18T23:16:27Z

Thank you for the great work!

Could you please provide some examples about functional approach with distributed multi-gpu training?

luyug · 2022-01-20T02:01:37Z

Hi @kevinlin311tw , sure, I can add an example in a day or two.

As a side note, the functional approach itself is actually agnostic to parallelism: you need only to wrap your encoder model and do cross process communication in the loss function. Maybe this comment will be helpful if you want to give it a try yourself.

luyug · 2022-01-21T04:27:15Z

I've added an example in the readme, along with a new all-gather decorator that may be helpful.

Feel free to ping me if you have any questions or find any problems with the code.

memray · 2023-09-01T04:37:40Z

@luyug I wonder if you have any workable example using functional approach on DDP?
I ran into the error below

    surrogate.backward()
  File "/export/share/ruimeng/env/anaconda/envs/llm/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/export/share/ruimeng/env/anaconda/envs/llm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 195 with name q_encoder.encoder.encoder.layer.11.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

According to this post, DDP doesn't seem to support multiple feedforward/backward pass calls. Can you confirm the case and/or provide any solutions?

Thank you,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

functional approach with distributed training #5

functional approach with distributed training #5

kevinlin311tw commented Jan 18, 2022

luyug commented Jan 20, 2022

luyug commented Jan 21, 2022

memray commented Sep 1, 2023 •

edited

functional approach with distributed training #5

functional approach with distributed training #5

Comments

kevinlin311tw commented Jan 18, 2022

luyug commented Jan 20, 2022

luyug commented Jan 21, 2022

memray commented Sep 1, 2023 • edited

memray commented Sep 1, 2023 •

edited