Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kept for Feedback] Multi-GPU & New models #1

Open
TiankaiHang opened this issue Mar 24, 2021 · 21 comments
Open

[Kept for Feedback] Multi-GPU & New models #1

TiankaiHang opened this issue Mar 24, 2021 · 21 comments
Labels
fixed fixed but issue remains open question Further information is requested

Comments

@TiankaiHang
Copy link

Thanks for your nice work and congratulations on your good results!

I have several questions.

  • Will your model extended to Parallel (distributed data-parallel) in the future.
  • Why don't you try to use deeplabv3+, will it lead to a better result?

Best.

@voldemortX
Copy link
Owner

voldemortX commented Mar 24, 2021

@TiankaiHang Hi! Thanks for your attention in our work!

  • Will your model extended to Parallel (distributed data-parallel) in the future.

I'd love to do that, but I have never had access to a machine with more than 1 GPU in my entire life... So if you or anyone else can make a pull request to support this would be really nice...

  • Why don't you try to use deeplabv3+, will it lead to a better result?

I believe a better model would lead to better results. But training V3/V3+ requires at least double the compute budget, that's why I did not do them. Also because the V2 results are still important for comparison against prior arts, that would lead to at least 3x compute budget if I chose to use V3/V3+ back then. I just do not have the cards.

Some additional info: On ResNet backbones, my experience tells me that V3+ could be worse than V3. For background: pytorch/vision#2689

@voldemortX voldemortX added question Further information is requested good first issue Good for newcomers help wanted Extra attention is needed labels Mar 24, 2021
@TiankaiHang
Copy link
Author

Thanks for your kind reply~ :-)

Best.

1 similar comment
@TiankaiHang
Copy link
Author

Thanks for your kind reply~ :-)

Best.

@voldemortX
Copy link
Owner

voldemortX commented Mar 25, 2021

You're welcome.
I'll pin this issue as a call for help to:

@voldemortX voldemortX pinned this issue Mar 25, 2021
@voldemortX voldemortX changed the title Several Questions .... [Help wanted] Multi-GPU & New models Mar 25, 2021
@jinhuan-hit
Copy link

jinhuan-hit commented Mar 25, 2021

I will update it for multi-gpus after I reproduce it. Maybe next week, now the time is not enough.

@voldemortX
Copy link
Owner

I will update it for multi-gpus after I reproduce it. Maybe next week, now the time is not enough.

That's great to hear! Go for it!

@lorenmt
Copy link

lorenmt commented May 18, 2021

I would suggest checking out: https://github.com/huggingface/accelerate which should be relatively easy to deploy any model in a distributed setting.

@voldemortX
Copy link
Owner

@lorenmt Good point! Thanks a lot!
@jinhuan-hit If you're still working on this, Accelerate seems a good place to start. And it's perfectly ok if you don't want to send a PR just now. I'll update for multi-GPU myself when I got more bandwidth and cards for testing, it should be soon (when I get my internship).

@jinhuan-hit
Copy link

@lorenmt Good point! Thanks a lot!
@jinhuan-hit If you're still working on this, Accelerate seems a good place to start. And it's perfectly ok if you don't want to send a PR just now. I'll update for multi-GPU myself when I got more bandwidth and cards for testing, it should be soon (when I get my internship).

Yeah, thank you for your sharing and I am still working on this project. I'am sorry to say that I don't update it for multi-GPU until now. Something changed, I reproduce this project on another job. So the multi-GPU code is not match now. I'm trying to add multi-GPU on this project following Accelerate, today. Unfortunately, I have not solved this bug below.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

@voldemortX
Copy link
Owner

voldemortX commented Jun 1, 2021

@jinhuan-hit Thanks a lot! I still don't have the hardware to debug multi-GPU for now. But hopefully I'll be able to debug this month/the next.
The problem seems related to network design, I don't remember I have additional (unused) parameters though, I'll check that later tonight and get back to you.

@jinhuan-hit
Copy link

@jinhuan-hit Thanks a lot! I still don't have the hardware to debug multi-GPU for now. But hopefully I'll be able to debug this month/the next.
The problem seems related to network design, I don't remember I have additional (unused) parameters though, I'll check that later tonight and get back to you.

I have already check the network, but find nothing. Looking forward to hearing your good results!

@voldemortX
Copy link
Owner

I have already check the network, but find nothing.

Yes I think you're right. I also did not find redundant layers.

I'll also try investigate this when I got the cards.

@TiankaiHang
Copy link
Author

@lorenmt Good point! Thanks a lot!
@jinhuan-hit If you're still working on this, Accelerate seems a good place to start. And it's perfectly ok if you don't want to send a PR just now. I'll update for multi-GPU myself when I got more bandwidth and cards for testing, it should be soon (when I get my internship).

Yeah, thank you for your sharing and I am still working on this project. I'am sorry to say that I don't update it for multi-GPU until now. Something changed, I reproduce this project on another job. So the multi-GPU code is not match now. I'm trying to add multi-GPU on this project following Accelerate, today. Unfortunately, I have not solved this bug below.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

Have you tried to set find_unused_parameters=True in your code?
Maybe you will get more detailed error information.

@jinhuan-hit
Copy link

jinhuan-hit commented Jun 2, 2021

@lorenmt Good point! Thanks a lot!
@jinhuan-hit If you're still working on this, Accelerate seems a good place to start. And it's perfectly ok if you don't want to send a PR just now. I'll update for multi-GPU myself when I got more bandwidth and cards for testing, it should be soon (when I get my internship).

Yeah, thank you for your sharing and I am still working on this project. I'am sorry to say that I don't update it for multi-GPU until now. Something changed, I reproduce this project on another job. So the multi-GPU code is not match now. I'm trying to add multi-GPU on this project following Accelerate, today. Unfortunately, I have not solved this bug below.
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

Have you tried to set find_unused_parameters=True in your code?
Maybe you will get more detailed error information.

Yeah, you are right! I warp the network using DDP with find_unused_parameters=True by myself, but it doesn't work. However, when I add find_unused_parameters=True to the function prepare of accelerator in Accelerate package, the job works well. Unfortunately, I'm sorry to say that I have not verify the result. The package version I used: torch==1.4.0, torchvision==0.5.0, accelerate==0.1.0.

def prepare_model(self, model):
    if self.device_placement:
        model = model.to(self.device)
    if self.distributed_type == DistributedType.MULTI_GPU:
        model = torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[self.local_process_index],
            output_device=self.local_process_index,
            find_unused_parameters=True,
        )
    if self.native_amp:
        model.forward = torch.cuda.amp.autocast()(model.forward)
    return model

Also, I change the main.py following https://github.com/huggingface/accelerate

1.change
    device = torch.device('cpu')
    if torch.cuda.is_available():
         device = torch.device('cuda:0')
to
    # modify to accelerator
    accelerator = Accelerator()
    device = accelerator.device
2.add
    # modify to accelerator
    net, optimizer = accelerator.prepare(net, optimizer)
3.change
    scaled_loss.backward()
    loss.backward()
to
    accelerator.backward(scaled_loss)
    accelerator.backward(loss)

Then it should work. Best wishes.

@voldemortX
Copy link
Owner

@jinhuan-hit If the results are similar compared to single card under mixed precision, maybe you'd like to send a pull request for this?

@jinhuan-hit
Copy link

@jinhuan-hit If the results are similar compared to single card under mixed precision, maybe you'd like to send a pull request for this?

Yeah, I'm checking the results now. If OK, I'd like to send a PR.

@voldemortX
Copy link
Owner

Thanks a lot! If a PyTorch version update that concerns code change is necessary for using Accelerate, please make the version update & multi-GPU in 2 separate PRs, if possible (one PR is also good).

@jinhuan-hit
Copy link

Thanks a lot! If a PyTorch version update that concerns code change is necessary for using Accelerate, please make the version update & multi-GPU in 2 separate PRs, if possible (one PR is also good).

I use pytorch1.4.0 because of Accelerate. Now I'm using fp32 in training and it works well without any code modification.

@jinhuan-hit
Copy link

I have checked the result and it looks normal!

@voldemortX
Copy link
Owner

I have checked the result and it looks normal!

Great! I'll formulate a draft PR for comments.

@voldemortX
Copy link
Owner

Thanks for everyone's help! DDP is now supported. Please report bugs if you've found any.

@voldemortX voldemortX added fixed fixed but issue remains open and removed help wanted Extra attention is needed good first issue Good for newcomers labels Aug 23, 2021
@voldemortX voldemortX unpinned this issue Aug 23, 2021
@voldemortX voldemortX changed the title [Help wanted] Multi-GPU & New models [Kept for Feedback] Multi-GPU & New models Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed fixed but issue remains open question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants