Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use all_gather in training loop? #2504

Open
kkarrancsu opened this issue Mar 7, 2022 · 11 comments
Open

how to use all_gather in training loop? #2504

kkarrancsu opened this issue Mar 7, 2022 · 11 comments
Labels

Comments

@kkarrancsu
Copy link

kkarrancsu commented Mar 7, 2022

I have defined my train_step in the exact same way as in the cifar10 example. Is it possible to gather all of the predictions before computing the loss? I haven't seen examples of this pattern in the ignite examples (maybe I'm missing it?), but for my application, it is more optimal to compute the loss after aggregating the forward passes and targets run on multiple GPU's. This only matters when using DistributedDataParallel, since DataParallel automatically aggregates the outputs.

I see the idist.all_gather() function, but am unclear how to use it in a training loop.

@sdesrozis
Copy link
Contributor

@kkarrancsu Thanks for your answer.

In general, idist.all_gather() can be used as long as the call is made collectively by all the processes. Therefore, you can use this method to gather the predictions in your training loop.

I can provide an example asap and maybe update the doc accordingly.

However, I'm not completely sure about your question. In fact, if you want to compute predictions in ddp, gather and back propagate from one proc, it won't work. You can check the internal design https://pytorch.org/docs/stable/notes/ddp.html#internal-design

@kkarrancsu
Copy link
Author

kkarrancsu commented Mar 7, 2022

@sdesrozis Thanks for your quick reply! Sorry if my initial question was unclear. As an example:

m = model()
m_dp = nn.DataParallel(m)
m_ddp = nn.DistributedDataParallel(m)

x = input # [batch_size, ...]
y_dp = m_dp(x)  # [batch_size, ...]
y_ddp = m_ddp(x) # [batch_size/ngpu, ...]

I'd like to gather all the y_ddp from all gpu's before computing a loss. I hope that makes the question clear?

@sdesrozis
Copy link
Contributor

Thanks for the clarification. Would you like to use the loss as a metric ? Or would you want to call loss.backward() ?

@kkarrancsu
Copy link
Author

I'd like to call loss.backward()

@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 8, 2022

Ok so I think it won't work even if you gather the predictions. The gathering operation is not an autodiff function so it will cut the graph computation. The forward pass creates some internal states that won't be gathered too.

Although I'm pretty sure that is answered in the PyTorch forum. Maybe I'm wrong though and I would be interested by a few discussions about this topic.

EDIT see here https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8

@kkarrancsu
Copy link
Author

@sdesrozis Thanks - I will investigate based on your link and report back.

@sdesrozis
Copy link
Contributor

@sdesrozis Thanks - I will investigate based on your link and report back.

Good ! Although I’m doubtful about the link… Interesting by your feedback.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 8, 2022

@kkarrancsu can you provide a bit more details on what exactly you would like to do ?
In DDP, data is distributed to N processes and model is cloned. When we do the forward pass each process obtains predictions y_preds = m_ddp(x) on its data chunk and using a loss function and loss.backward() we can compute gradients that are finally sum up and applied to the model internally by pytorch DDP model wrapper.

As for distributed autograd, you can check as well : https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework

@kkarrancsu
Copy link
Author

Hi @vfdev-5, sure.

We are using the Supervised Contrastive loss to train an embedding. In Eq. 2 of the paper, we see that the loss depends on the number of samples used to compute it (positive and negative).

My colleague suggested to me that it is more optimal to compute the loss considering all examples (the entire batch), rather than considering batch/ngpu samples (which is what would happen when using DDP and computing loss locally to each GPU). This is because the denominator in SupConLoss is computing the loss of negative samples, and by first aggregating all of the negative samples across gpus, you would get a more accurate loss.

@sdesrozis
Copy link
Contributor

Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance

https://github.com/Spijkervet/SimCLR/blob/cd85c4366d2e6ac1b0a16798b76ac0a2c8a94e58/simclr/modules/nt_xent.py#L7

This might give you some inspiration.

@lxysl
Copy link

lxysl commented Sep 2, 2023

Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance

https://github.com/Spijkervet/SimCLR/blob/cd85c4366d2e6ac1b0a16798b76ac0a2c8a94e58/simclr/modules/nt_xent.py#L7

This might give you some inspiration.

This code is not so correct. Please check this issue: Spijkervet/SimCLR#30 and my pr: Spijkervet/SimCLR#46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants