Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

Open
yxk9810 opened this issue Apr 29, 2024 · 2 comments

Comments

@yxk9810
Copy link

yxk9810 commented Apr 29, 2024

File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

pytorch==1.12.0
transformers==4.20.0

runing in T4 x2gpu on kaggle: https://www.kaggle.com/code/jackiewu/notebook738aa0f5d2

@yxk9810
Copy link
Author

yxk9810 commented Apr 29, 2024

runing command as:
! CUDA_VISIBLE_DEVICES=0,1 python -m tevatron.driver.train
--output_dir model_msmarco
--model_name_or_path bert-base-uncased
--save_steps 1000
--train_dir /kaggle/working/train_tevatron_100.json
--fp16
--per_device_train_batch_size 2
--train_n_passages 8
--learning_rate 5e-6
--q_max_len 64
--p_max_len 460
--num_train_epochs 3
--logging_steps 500
--overwrite_output_dir

@MXueguang
Copy link
Contributor

will adding `--negatives_x_device ' helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants