[REQUEST] Launcher mode with SSH bypass #5510

dogacancolak-kensho · 2024-05-08T20:53:08Z

Is your feature request related to a problem? Please describe.
#2679
As previously mentioned in this issue, the existing launching mechanism requires password-less SSH. We did not prefer this at Kensho Technologies, as our current multi-node training framework uses a launching mechanism similar to torchrun.

Instead of a launcher node ssh-ing the command to the workers, torchrun works by providing a master address/port, and a node rank for each worker. By bypassing SSH and using deepspeed directly like torchrun, we can seamlessly integrate DeepSpeed to our existing setup, instead of having two different launching topologies.

Describe the solution you'd like
In a private fork of DeepSpeed, we were able to get training working without using SSH. To do this, we added a flag to the launcher-runner called --no_ssh, which also depends on a --node_rank flag to be provided.

Then, in the runner, the command is ran as if multi_node_exec is disabled. We have verified that this method works.

Describe alternatives you've considered
As mentioned, we considered setting up two topologies based on the framework used. For example, GPT-NeoX uses the deepspeed launcher, therefore we would need the SSH setup. However, MosaicML's llm-foundry works by independently running the command on each worker (similar to torchrun). We didn't want to create two architectures depending on which framework was being used for training.

Additional context
If deemed useful by the project maintainers, we can make a PR, with S&P Global/Kensho Technologies as the contributing entity.

The text was updated successfully, but these errors were encountered:

tjruwase · 2024-05-08T23:11:44Z

@dogacancolak-kensho, thanks for offering a PR for this useful enhancement. Please submit the PR at your convenience. Thanks!

dogacancolak-kensho · 2024-05-23T15:06:12Z

Do I need to be given permissions? I'm trying to push my local branch dogacancolak/no-ssh-launcher

$ git push
ERROR: Permission to microsoft/DeepSpeed.git denied to dogacancolak-kensho.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

dogacancolak-kensho added the enhancement New feature or request label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Launcher mode with SSH bypass #5510

[REQUEST] Launcher mode with SSH bypass #5510

dogacancolak-kensho commented May 8, 2024

tjruwase commented May 8, 2024

dogacancolak-kensho commented May 23, 2024

[REQUEST] Launcher mode with SSH bypass #5510

[REQUEST] Launcher mode with SSH bypass #5510

Comments

dogacancolak-kensho commented May 8, 2024

tjruwase commented May 8, 2024

dogacancolak-kensho commented May 23, 2024