-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model performance degrades when moved to Multi-GPU #29
Comments
Those changes should be sufficient to enable multi-gpu training in my experience. Is there any other difference (e.g. batch size) between the two runs? |
Nope, I did not change any of the variables in |
This is probably a silly question, but did you try this multiple times and receive the same results? |
Yes, I run the code with the same configuration multiples times. There is no difference across different runs. |
Sorry, I am not sure why this is happening. I recommend that you try the Simple Transformers library as it supports multi-gpu training by default and I have used multi-gpu training with that library without any performance degradation. |
Hi,
When I run your code on multi-gpu, performance degrades severely (compared to the single-gpu version). To make the code multi-gpu competable, I've only added 2 lines of code:
model = nn.torch.DataParallel(model)
between yourmodel = model_class.from_pretrained(args['model_name'])
andmodel.to(device)
callsloss = loss.mean()
after theloss = outputs[0]
line in the train function. Do you have any idea how can I get the same (or similar) performance on Multi-GPU setting?These are the results I got with these two settings:
With Multi-GPU training:
evaluate_loss: = 0.3928874781464829
fn = 116
fp = 81
mcc = 0.5114751200090137
tn = 1291
tp = 136
With Single-GPU Training:
evaluate_loss: = 0.39542119007776766
fn = 82
fp = 126
mcc = 0.5465463104769824
tn = 1246
tp = 170
Although avg loss values are similar, there are big differences in other metrics.
The text was updated successfully, but these errors were encountered: