Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model performance degrades when moved to Multi-GPU #29

Open
ereday opened this issue Nov 8, 2019 · 5 comments
Open

Model performance degrades when moved to Multi-GPU #29

ereday opened this issue Nov 8, 2019 · 5 comments

Comments

@ereday
Copy link

ereday commented Nov 8, 2019

Hi,

When I run your code on multi-gpu, performance degrades severely (compared to the single-gpu version). To make the code multi-gpu competable, I've only added 2 lines of code:

  • model = nn.torch.DataParallel(model) between your model = model_class.from_pretrained(args['model_name']) and model.to(device) calls

  • loss = loss.mean() after the loss = outputs[0] line in the train function. Do you have any idea how can I get the same (or similar) performance on Multi-GPU setting?

These are the results I got with these two settings:

  • With Multi-GPU training:
    evaluate_loss: = 0.3928874781464829
    fn = 116
    fp = 81
    mcc = 0.5114751200090137
    tn = 1291
    tp = 136

  • With Single-GPU Training:
    evaluate_loss: = 0.39542119007776766
    fn = 82
    fp = 126
    mcc = 0.5465463104769824
    tn = 1246
    tp = 170

Although avg loss values are similar, there are big differences in other metrics.

@ThilinaRajapakse
Copy link
Owner

Those changes should be sufficient to enable multi-gpu training in my experience. Is there any other difference (e.g. batch size) between the two runs?

@ereday
Copy link
Author

ereday commented Nov 8, 2019

Nope, I did not change any of the variables in args dictionary.

@ThilinaRajapakse
Copy link
Owner

This is probably a silly question, but did you try this multiple times and receive the same results?

@ereday
Copy link
Author

ereday commented Nov 8, 2019

Yes, I run the code with the same configuration multiples times. There is no difference across different runs.

@ThilinaRajapakse
Copy link
Owner

Sorry, I am not sure why this is happening. I recommend that you try the Simple Transformers library as it supports multi-gpu training by default and I have used multi-gpu training with that library without any performance degradation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants