Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck after distributed to GPU #37

Open
wmhcqw opened this issue Jul 7, 2021 · 2 comments
Open

Stuck after distributed to GPU #37

wmhcqw opened this issue Jul 7, 2021 · 2 comments

Comments

@wmhcqw
Copy link

wmhcqw commented Jul 7, 2021

First, thank you for you awesome work.

Here, I've met some problems when running your code.

For multi queries ( number of unique qid > 1 ), the code stuck at this point.

...
[INFO] 2021-07-07 09:25:46 - loaded dataset with 2 queries
[INFO] 2021-07-07 09:25:46 - longest query had 171 documents
[INFO] 2021-07-07 09:25:46 - val DS shape: [2, 171, 175]
[INFO] 2021-07-07 09:25:46 - Will pad to the longest slate: 171
[INFO] 2021-07-07 09:25:46 - total batch size is 128
[INFO] 2021-07-07 09:25:46 - Model training will execute on cuda
[INFO] 2021-07-07 09:25:46 - Model training will be distributed to 8 GPUs.
[INFO] 2021-07-07 09:25:48 - Model has 36868 trainable parameters
[INFO] 2021-07-07 09:25:48 - Current learning rate: 0.001

I've tried to wait for about three hours while still no progress (Both for my dataset and your sample generated dummy dataset). And I even can't use Ctrl+c to stop this process. When I check my cuda with nvidia-smi, the usage of GPUs is normal.

Then I tried with single query (with all qid:0) dataset, the code runs fine.

So, what might cause this problem?

System: Ubuntu 20.04
GPUs: A100, 40G
Config: local_config.json only changed the data root path

@wmhcqw
Copy link
Author

wmhcqw commented Jul 7, 2021

UPDATE: This is an issue with Multi GPUs.

SOLUTION:I've tried multi queries on Singel GPU (using CUDA_VISIBLE_DEVICES=0), the code runs fine.

You can close this issue if the problem is located.

@sadaharu-inugami
Copy link
Contributor

Thank you for submitting the issue. We'll keep the issue open for now as we're going to investigate the DataParallel issues the following week. We have DistributedDataParallel in the roadmap (much more effective than standard DataParallel) but that's still a somewhat distant perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants