Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gradient compression updates #395

Open
wants to merge 308 commits into
base: master
Choose a base branch
from
Open

Conversation

jasperzhong
Copy link
Contributor

This PR includes:

  1. support PyTorch and Apex.
  2. reduce compression overhead for fp16 gradients by converting to fp32 before compressing and converting to fp16 when returned to the application.
  3. improve the random-k algorithm by using the same seed for all workers and servers. Our results show that the improved algorithm can successfully train ResNet-50 on ImageNet without accuracy loss.
  4. achieve workload balance by using the original size as the workload estimation in servers.
  5. add vanilla error-feedback, which does not use learning rate for correction.
  6. add sparse error-feedback for the random-k algorithm.
  7. update docs.

Bug fixes:

  1. fix MXNet's extension linking PyTorch's libraries. (setup.py)

The PR does not cover passing learning rate to remote servers. It also does not cover hang issue in MXNet. The PR is ready for merge.

cc: @eric-haibin-lin @szhengac

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is too large. Is it possible to break it down into smaller independent ones?

auto out = reinterpret_cast<unsigned short*>(dst);
len = len / (size_t)2;

#if __AVX__ && __F16C__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the version under half.h.

docs/best-practice.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants