[Feature Request] multiple GPUs on a single machine #2160

sgfCrazy · 2024-05-13T09:20:13Z

Motivation

I want to train using multiple GPUs on a single machine, but I can't find relevant tutorial documentation.

Could you provide an example of training using multiple GPUs on a single machine? For instance, updating the network on cuda:0 while gathering data on cuda:1?

please, thanks.

Solution

A clear and concise description of what you want to happen.

Alternatives

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Checklist

I have checked that there is no similar issue in the repo (required)

sgfCrazy · 2024-05-13T09:55:31Z

When I run this script(https://github.com/pytorch/rl/blob/v0.3.1/examples/distributed/collectors/single_machine/generic.py), it reports the following error.
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers.
logger.warn(
[{'device': 'cuda:1', 'storing_device': 'cuda:1'}]
cuda:0
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers.
logger.warn(
/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers.
logger.warn(
0%| | 0/3000000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/cephfs/PERSONAL/usr/chenjiaxin/sgf/code/gfkd/DHPT/tests/test_multi_gpu_one_machine.py", line 162, in
for i, data in enumerate(collector):
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torchrl/collectors/distributed/generic.py", line 783, in iterator
yield from self._iterator_dist()
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torchrl/collectors/distributed/generic.py", line 799, in _iterator_dist
self._tensordict_out[i].irecv(src=rank, return_premature=True)
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/tensordict/base.py", line 3613, in irecv
return self._irecv(
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/tensordict/base.py", line 3654, in _irecv
_future_list.append(dist.irecv(value, src=src, tag=_tag, group=group))
File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1628, in irecv
return pg.recv([tensor], src, tag)
RuntimeError: No backend type associated with device type cpu

sgfCrazy · 2024-05-13T10:37:04Z

system info:

import torchrl, tensordict, torch, numpy, sys
print(torch.version, tensordict.version, torchrl.version, numpy.version, sys.version, sys.platform)

2.2.1+cu121 0.3.1 0.3.1 1.26.4 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] linux

sgfCrazy added the enhancement New feature or request label May 13, 2024

sgfCrazy assigned vmoens May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] multiple GPUs on a single machine #2160

[Feature Request] multiple GPUs on a single machine #2160

sgfCrazy commented May 13, 2024

sgfCrazy commented May 13, 2024

sgfCrazy commented May 13, 2024

[Feature Request] multiple GPUs on a single machine #2160

[Feature Request] multiple GPUs on a single machine #2160

Comments

sgfCrazy commented May 13, 2024

Motivation

Solution

Alternatives

Additional context

Checklist

sgfCrazy commented May 13, 2024

sgfCrazy commented May 13, 2024