ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

yxk9810 · 2024-04-29T11:59:35Z

File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

pytorch==1.12.0
transformers==4.20.0

runing in T4 x2gpu on kaggle: https://www.kaggle.com/code/jackiewu/notebook738aa0f5d2

yxk9810 · 2024-04-29T12:00:22Z

runing command as:
! CUDA_VISIBLE_DEVICES=0,1 python -m tevatron.driver.train
--output_dir model_msmarco
--model_name_or_path bert-base-uncased
--save_steps 1000
--train_dir /kaggle/working/train_tevatron_100.json
--fp16
--per_device_train_batch_size 2
--train_n_passages 8
--learning_rate 5e-6
--q_max_len 64
--p_max_len 460
--num_train_epochs 3
--logging_steps 500
--overwrite_output_dir

MXueguang · 2024-06-09T04:57:52Z

will adding `--negatives_x_device ' helps?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

yxk9810 commented Apr 29, 2024

yxk9810 commented Apr 29, 2024

MXueguang commented Jun 9, 2024

ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

Comments

yxk9810 commented Apr 29, 2024

yxk9810 commented Apr 29, 2024

MXueguang commented Jun 9, 2024