Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] multiple GPUs on a single machine #2160

Open
1 task
sgfCrazy opened this issue May 13, 2024 · 2 comments
Open
1 task

[Feature Request] multiple GPUs on a single machine #2160

sgfCrazy opened this issue May 13, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@sgfCrazy
Copy link

Motivation

I want to train using multiple GPUs on a single machine, but I can't find relevant tutorial documentation.

Could you provide an example of training using multiple GPUs on a single machine? For instance, updating the network on cuda:0 while gathering data on cuda:1?

please, thanks.

Solution

A clear and concise description of what you want to happen.

Alternatives

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Checklist

  • I have checked that there is no similar issue in the repo (required)
@sgfCrazy sgfCrazy added the enhancement New feature or request label May 13, 2024
@sgfCrazy
Copy link
Author

When I run this script(https://github.com/pytorch/rl/blob/v0.3.1/examples/distributed/collectors/single_machine/generic.py), it reports the following error.
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers.
logger.warn(
[{'device': 'cuda:1', 'storing_device': 'cuda:1'}]
cuda:0
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers.
logger.warn(
0%| | 0/3000000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/cephfs/PERSONAL/usr/chenjiaxin/sgf/code/gfkd/DHPT/tests/test_multi_gpu_one_machine.py", line 162, in
for i, data in enumerate(collector):
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torchrl/collectors/distributed/generic.py", line 783, in iterator
yield from self._iterator_dist()
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torchrl/collectors/distributed/generic.py", line 799, in _iterator_dist
self._tensordict_out[i].irecv(src=rank, return_premature=True)
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/tensordict/base.py", line 3613, in irecv
return self._irecv(
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/tensordict/base.py", line 3654, in _irecv
_future_list.append(dist.irecv(value, src=src, tag=_tag, group=group))
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1628, in irecv
return pg.recv([tensor], src, tag)
RuntimeError: No backend type associated with device type cpu

@sgfCrazy
Copy link
Author

system info:

import torchrl, tensordict, torch, numpy, sys
print(torch.version, tensordict.version, torchrl.version, numpy.version, sys.version, sys.platform)

2.2.1+cu121 0.3.1 0.3.1 1.26.4 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants