Requesting example to use PyTorch FSDP #19

abdulmuneer · 2024-04-22T11:39:23Z

Hi,
Does Determined support the PyTorch FSDP way of distributed training? I can see examples for DeepSpeed, but I have a requirement to specifically use native FSDP feature of PyTorch 2.2 (something like https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?highlight=pre%20training).

ioga · 2024-04-22T15:55:16Z

hello, we haven't added it here yet, but there's an unofficial example here: https://github.com/garrett361/determined/tree/scratchwork/scratchwork/fsdp_min

For the context, PytorchTrial does not support FSDP and there're no plans to add that. For FSDP, you should use Core API instead, and it'll be effectively the same as the torch DDP: standard torch distributed launcher works the same, metrics logging and hpsearch work the same. if you checkpoint full model from rank=0, it'll work the same as well. if you want to do sharded checkpointing, use the sharded checkpointing shard=True option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting example to use PyTorch FSDP #19

Requesting example to use PyTorch FSDP #19

abdulmuneer commented Apr 22, 2024

ioga commented Apr 22, 2024

Requesting example to use PyTorch FSDP #19

Requesting example to use PyTorch FSDP #19

Comments

abdulmuneer commented Apr 22, 2024

ioga commented Apr 22, 2024