gradient compression updates #395

jasperzhong · 2021-05-10T08:59:01Z

This PR includes:

support PyTorch and Apex.
reduce compression overhead for fp16 gradients by converting to fp32 before compressing and converting to fp16 when returned to the application.
improve the random-k algorithm by using the same seed for all workers and servers. Our results show that the improved algorithm can successfully train ResNet-50 on ImageNet without accuracy loss.
achieve workload balance by using the original size as the workload estimation in servers.
add vanilla error-feedback, which does not use learning rate for correction.
add sparse error-feedback for the random-k algorithm.
update docs.

Bug fixes:

fix MXNet's extension linking PyTorch's libraries. (setup.py)

The PR does not cover passing learning rate to remote servers. It also does not cover hang issue in MXNet. The PR is ready for merge.

cc: @eric-haibin-lin @szhengac

eric-haibin-lin

This PR is too large. Is it possible to break it down into smaller independent ones?

eric-haibin-lin · 2021-05-22T00:21:38Z

byteps/common/cpu_reducer.cc

- auto out = reinterpret_cast<unsigned short*>(dst);
- len = len / (size_t)2;
-
-#if __AVX__ && __F16C__


Why removing them?

We use the version under half.h.

docs/best-practice.md

jasperzhong added 30 commits August 30, 2020 17:13

missing semicolon...

62966d7

fix typos

e6a521a

missing semicolon...

9260d32

add default

bb93d64

fix bugs

6346c5c

fix missing s

6491eef

use nag for dithering

0bfd387

update train script

8e0c85b

update ef for dithering

152ce49

fix typo

6a2c22a

missing scope

7fc3d54

fix compile error

91d6c2e

debug

4be835d

debug

14139a8

update

69b4117

add omp for randomk

a02607d

update

5251789

fix typo

f5f5d58

missing copy

1d3d8e3

update

60b31be

update

caadd7f

randomk with replacement

c3d8341

fix missing header

7c65805

fix typo

5001140

use unordered_map

94b7ba3

update

c3e8e26

fix small bug

2a39ff4

worker decompress use buf as input

14a9d10

use __restrict__

9a34a97

fix missing =default

affde37

jasperzhong added 28 commits January 5, 2021 14:28

add

37dc7c0

update

4240ebd

update

c0ac6aa

disable rdma

64775f2

try to balance workload

9c8d8da

update

a7d8e91

update

684e61a

update

d28bf62

update

a0bcb30

restore

1096f84

add MSHADOW_USE_F16C=1

8104719

test

352cf47

test

6add67d

update

30005b5

add check

7a95a07

remove

914b5cf

update

4c849a9

update

f131f14

update

bdb9d41

update

aa809a7

debug

b741a9d

test topk

a21d46f

remove unnecessary tests

20e27cc

Merge branch 'master' into apex

9c95ba6

remove

178232b

update

7fe8f43

update docs

65db55b

update docs

0821569

eric-haibin-lin requested changes May 22, 2021

View reviewed changes

revert

78eb37e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradient compression updates #395

gradient compression updates #395

jasperzhong commented May 10, 2021

eric-haibin-lin left a comment

eric-haibin-lin May 22, 2021

jasperzhong May 22, 2021 •

edited

gradient compression updates #395

Are you sure you want to change the base?

gradient compression updates #395

Conversation

jasperzhong commented May 10, 2021

eric-haibin-lin left a comment

Choose a reason for hiding this comment

eric-haibin-lin May 22, 2021

Choose a reason for hiding this comment

jasperzhong May 22, 2021 • edited

Choose a reason for hiding this comment

jasperzhong May 22, 2021 •

edited