Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpi mode sets all nodes to the same SM_CURRENT_HOST #158

Open
verdimrc opened this issue Oct 31, 2022 · 0 comments
Open

Mpi mode sets all nodes to the same SM_CURRENT_HOST #158

verdimrc opened this issue Oct 31, 2022 · 0 comments

Comments

@verdimrc
Copy link

verdimrc commented Oct 31, 2022

Describe the bug
With mpi mode, all nodes report the same SM_CURRENT_HOST (which is the master's one).

To reproduce
Run an PyTorch estimator in mpi mode and more than one node. The training entrypoint can simply dump all its environment variables to stdout (which should end-up on Cloudwatch log). From there, we can see that SM_CURRENT_HOST from all nodes are set to the same value (i.e., the master's), whereas PMIX_HOSTNAME is set correctly.

Expected behavior
Master node should not propagate its SM_CURRENT_HOST to the other nodes.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

Screenshot 2022-08-10 at 20 28 00

System information
PyTorch DLC 1.11.0-gpu-py38

Additional context
Add any other context about the problem here.

This patch corrected the SM_CURRENT_HOST issue on my training jobs.

# https://github.com/aws/sagemaker-training-toolkit/blob/3188a9df7803798defb043a332d789f7474219d0/src/sagemaker_training/mpi.py#L353
        for name in self._env_vars:
            if name.startswith("SM_"):    # New addition
                continue                  # New addition
            command.extend(["-x", name])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant