-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Launcher mode with SSH bypass #5510
Labels
enhancement
New feature or request
Comments
@dogacancolak-kensho, thanks for offering a PR for this useful enhancement. Please submit the PR at your convenience. Thanks! |
Do I need to be given permissions? I'm trying to push my local branch
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
#2679
As previously mentioned in this issue, the existing launching mechanism requires password-less SSH. We did not prefer this at Kensho Technologies, as our current multi-node training framework uses a launching mechanism similar to
torchrun
.Instead of a launcher node ssh-ing the command to the workers,
torchrun
works by providing a master address/port, and a node rank for each worker. By bypassing SSH and using deepspeed directly like torchrun, we can seamlessly integrate DeepSpeed to our existing setup, instead of having two different launching topologies.Describe the solution you'd like
In a private fork of DeepSpeed, we were able to get training working without using SSH. To do this, we added a flag to the launcher-runner called
--no_ssh
, which also depends on a--node_rank
flag to be provided.Then, in the runner, the command is ran as if
multi_node_exec
is disabled. We have verified that this method works.Describe alternatives you've considered
As mentioned, we considered setting up two topologies based on the framework used. For example, GPT-NeoX uses the deepspeed launcher, therefore we would need the SSH setup. However, MosaicML's llm-foundry works by independently running the command on each worker (similar to torchrun). We didn't want to create two architectures depending on which framework was being used for training.
Additional context
If deemed useful by the project maintainers, we can make a PR, with S&P Global/Kensho Technologies as the contributing entity.
The text was updated successfully, but these errors were encountered: