DDP at eval stage + DDP metrics #2518

Adel-Moumen · 2024-04-22T10:03:38Z

What does this PR do?

Goal: Add support for distributed metrics inside SpeechBrain.

Why: We are currently running the training set on a distributed system, which significantly reduces training time of SpeechBrain's. However, due to a lack of supported features for distributed metrics, we are performing the evaluation stages (validation & testing) on a single GPU, despite potentially having access to many more. This practice renders SpeechBrain noncompetitive compared to other toolkits because the evaluation inference time is notably slow. Utilizing all available GPUs during testing could significantly reduce the testing time, potentially dividing it by N minus a small constant linked to the communication overhead, where N is the number of GPUs. For example, training a conformer transducer model typically requires 4x GPUs, while we are using only 1 GPU during evaluation stages, which can prolong the process to hours. Leveraging all 4x GPUs can offer a considerable speedup. This pull request (PR) is also part of our efforts to make SpeechBrain an attractive toolkit for "large-scale" training which will need to run evaluation on a tons of data.

How: To achieve this, I introduced various functions to gather tensors/objects across processes. Please refer to the file distributed_metrics.py for a better understanding of the available functions.

Issues: Since this PR can significantly enhance the inference speed of SpeechBrain, it's crucial to introduce it with caution. I've written extensive unit tests covering 100% of the metrics in SpeechBrain, all of which are functional in Distributed Data Parallel (DDP) mode. Additionally, I manually executed some Automatic Speech Recognition (ASR) recipes and verified the accuracy of the results. Based on my tests, we observe consistent results, indicating that this PR should not compromise our outcomes. However, I've only manually covered ASR tests, and I believe it's prudent to dedicate some time to investigating Text-to-Speech (TTS) models as well.

Results

Note: we are seeing some very smalls varation sometimes like 1.91 CER vs 1.92. I believe the issue is due to how CUDA works. Since you are now using multiple GPUs, they might not use the same CUDA kernel for doing operations.

DDP No Sync Metrics

w2v2 + CTC + librispeech + 2 GPUs (no syncs) + test BS = 1

100%|██████████| 2620/2620 [02:33<00:00, 17.03it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 3.16e-02, test CER: 5.10e-01, test WER: 1.92
100%|██████████| 2939/2939 [02:39<00:00, 18.42it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 7.46e-02, test CER: 1.28, test WER: 3.97

Whisper Transformer + librispeech + 2 GPUs (no sync) + test BS = 1

100%|██████████| 2703/2703 [07:45<00:00,  5.80it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.53, test CER: 1.91, test WER: 5.07
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
100%|██████████| 2620/2620 [07:33<00:00,  5.78it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.55, test CER: 1.86, test WER: 4.99
100%|██████████| 2939/2939 [07:10<00:00,  6.82it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.91, test CER: 5.78, test WER: 12.20

DDP Sync Metrics

w2v2 + CTC + librispeech + 2 GPUs + test BS = 1

100%|██████████| 1310/1310 [01:48<00:00, 12.02it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 3.16e-02, test CER: 5.10e-01, test WER: 1.92
100%|██████████| 1470/1470 [01:45<00:00, 13.95it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 7.46e-02, test CER: 1.28, test WER: 3.97

Whisper Transformer + librispeech + 2 GPUs (no sync) + test BS = 1

100%|██████████| 1352/1352 [04:48<00:00,  4.68it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.53, test CER: 1.92, test WER: 5.08
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
100%|██████████| 1310/1310 [04:17<00:00,  5.09it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.55, test CER: 1.86, test WER: 4.99
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
100%|██████████| 1470/1470 [04:22<00:00,  5.60it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.91, test CER: 5.79, test WER: 12.20

To Do

barrier ? https://github.com/Lightning-AI/pytorch-lightning/blob/b9680a364da4e875b237ec3c03e67a9c32ef475b/src/lightning/pytorch/strategies/ddp.py#L292-L299
logging on zero rank ? https://github.com/Lightning-AI/pytorch-lightning/blob/b9680a364da4e875b237ec3c03e67a9c32ef475b/src/lightning/pytorch/profilers/profiler.py#L74-L76
investigate DDP dataset issue

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

speechbrain/core.py

speechbrain/utils/distributed.py

tests/unittests/test_distributed_metrics.py

speechbrain/core.py

speechbrain/utils/distributed_metrics.py

speechbrain/core.py

speechbrain/utils/distributed.py

pplantinga

Supporting eval/validation with DDP is an important step, thanks for tackling it @Adel-Moumen . Tests for DDP and making metrics work across processes are crucial. However I think there may be some unnecessary complexity here. For example, I'm not convinced the DistributedState singleton adds anything on top of what torch.distributed already does.

Some options I can think of to address complexity:

add accelerate as a dependency and use their version of these functions, so we don't have to maintain it.
add torch-metrics as a dependency, https://github.com/Lightning-AI/torchmetrics, they have most of the metrics with distributed support already, again we don't have to maintain it.
Force a relatively simple and consistent format for the metrics -- then we don't need much complexity to sync other than torch.distributed

I know we're trying to avoid adding dependencies but these ones are relatively small and well worth it.

speechbrain/utils/distributed.py

speechbrain/core.py

speechbrain/utils/distributed.py

speechbrain/utils/distributed_metrics.py

speechbrain/utils/metric_stats.py

…tures by torch

Adel-Moumen · 2024-04-24T09:52:24Z

However I think there may be some unnecessary complexity here. For example, I'm not convinced the DistributedState singleton adds anything on top of what torch.distributed already does.

Hi @pplantinga, indeed. It does not in the current stage. As I said in my replies, this PoC aimed at showing one way of handling DDP informations in the future. Indeed, depending on your backend, you may need to follow different steps to achieve the same thing. Having a convenient dataclass would have helped to have a single interface and behind the hood all the logic required by those backends.

With respect to your comments, I decided to remove this class as it can be fully replaced by torch.distributed functions.

Some options I can think of to address complexity

Regarding the complexity, I have some hard time to understand where the complexity boils down. I am slightly biased since I am the one who spent the most time understanding how to synchronize metrics. The only complexity is due to the different backends that we can encounter in speechbrain (e.g. gloo vs nccl) which comes with different utilities functions but I don't think you can simplify unless we drop gloo support :) (which i strongly disagree).

add accelerate as a dependency and use their version of these functions, so we don't have to maintain it.
add torch-metrics as a dependency, Lightning-AI/torchmetrics, they have most of the metrics with distributed support already, again we don't have to maintain it.
Force a relatively simple and consistent format for the metrics -- then we don't need much complexity to sync other than torch.distributed

Again, I am still biased since I did the implementation but I really dont think we need a third package to handle the metrics. As I demonstrated, synchronizing can be summarized in < 250 lines of code (excluding docstring/tests). If you do install accelerate for metrics, you'll get thousands of lines of code that are useless mostly because they do support TPUs, Megatron, FSDP etc etc etc. I like the idea of having a SpeechBrain Trainer that does not rely on many toolkits and that can be easily hackable.

Adel-Moumen · 2024-04-24T09:55:25Z

Please jump in the discussion (one more time please its an important topic :p) @TParcollet @asumagic :) (note: i made some changes based on your comments [thanks again guys (@TParcollet @asumagic @pplantinga) !] and feel free to get back to me if you think that now I should move on to more tests/reporting more results etc). :)

TParcollet · 2024-04-24T12:16:06Z

@pplantinga I do not like the idea of adding accelerate and torch-metrics as dependencies. I do not trust the stability of both dependencies and I hate depending on other things when it's for something easy, like here. However, I agree that the Singleton class seems to be to much IMHO. I also agree on the homogeneity of the metrics, but maybe this could be another PR?

lucadellalib · 2024-04-25T20:31:23Z

The small variations in CER and WER might be related to the fact that when the dataset is not divisible by the number of processes, a few samples are duplicated : pytorch/pytorch#25162

Adel-Moumen · 2024-04-26T12:37:30Z

The small variations in CER and WER might be related to the fact that when the dataset is not divisible by the number of processes, a few samples are duplicated : pytorch/pytorch#25162

Indeed! Good point.

Adel-Moumen added 19 commits April 20, 2024 17:22

ddp metrics sync seems to work + sync the update avg

e61505c

sync_average_loss fn

d699f06

update

5b09ae7

update distributed

81e7657

pre-commit

f04b5b6

Merge remote-tracking branch 'speechbrain/develop' into sync_metrics

969fdb4

docstring

2a9b61b

move in metrics

dbbf8f7

update docstring

1b3b6d0

remove junk file

70dba5f

update gather

fcd05fe

remove outdated code

436e5d1

fix unittest

bba5f7b

add gather in sync_average_loss

017b4e2

unittest

de28a9e

update

3adf773

test all metrics

a24e802

docstring

bb44d02

update reduce method

83bd499

Adel-Moumen commented Apr 23, 2024

View reviewed changes

speechbrain/core.py Outdated Show resolved Hide resolved

Adel-Moumen commented Apr 23, 2024

View reviewed changes

speechbrain/utils/distributed.py Outdated Show resolved Hide resolved

Adel-Moumen commented Apr 23, 2024

View reviewed changes

speechbrain/utils/distributed.py Outdated Show resolved Hide resolved

Adel-Moumen commented Apr 23, 2024

View reviewed changes

tests/unittests/test_distributed_metrics.py Outdated Show resolved Hide resolved

TParcollet reviewed Apr 23, 2024

View reviewed changes

Adel-Moumen added 5 commits April 23, 2024 16:43

update code

9ded249

update code

e88c44b

enable -> enable_progressbar

3e77947

fix path

38ba5d2

fix name enable to be enable_progressbar

b20ed99

asumagic reviewed Apr 23, 2024

View reviewed changes

speechbrain/core.py Outdated Show resolved Hide resolved

speechbrain/core.py Outdated Show resolved Hide resolved

speechbrain/core.py Outdated Show resolved Hide resolved

speechbrain/utils/distributed.py Show resolved Hide resolved

Adel-Moumen added 4 commits April 23, 2024 17:05

fix potential issue with DDP on GPUs

d0e1d85

update reduce + test

84aa76c

remove comment

1bf09a4

reduce op

e6b6b60

pplantinga reviewed Apr 24, 2024

View reviewed changes

Adel-Moumen added 2 commits April 24, 2024 11:18

simplify codebase by removing DistributedState and using provided fea…

eadd3b8

…tures by torch

utilities fn

734d261

Adel-Moumen added 2 commits April 24, 2024 12:02

add docstring gater_object

5e897c7

improve test to reflect diff between gather_obj and gather

6caaed8

Adel-Moumen added 5 commits April 24, 2024 16:47

update tests metrics

398bbea

comment bert tests

760a3a4

add sacrebleu ci

b601fef

add again bert

39f0ca2

Merge remote-tracking branch 'speechbrain/develop' into sync_metrics

aa42be2

mravanelli requested a review from lucadellalib April 25, 2024 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP at eval stage + DDP metrics #2518

DDP at eval stage + DDP metrics #2518

Adel-Moumen commented Apr 22, 2024 •

edited

pplantinga left a comment

Adel-Moumen commented Apr 24, 2024

Adel-Moumen commented Apr 24, 2024

TParcollet commented Apr 24, 2024

lucadellalib commented Apr 25, 2024

Adel-Moumen commented Apr 26, 2024

DDP at eval stage + DDP metrics #2518

Are you sure you want to change the base?

DDP at eval stage + DDP metrics #2518

Conversation

Adel-Moumen commented Apr 22, 2024 • edited

What does this PR do?

Results

DDP No Sync Metrics

DDP Sync Metrics

To Do

PR review

pplantinga left a comment

Choose a reason for hiding this comment

Adel-Moumen commented Apr 24, 2024

Adel-Moumen commented Apr 24, 2024

TParcollet commented Apr 24, 2024

lucadellalib commented Apr 25, 2024

Adel-Moumen commented Apr 26, 2024

Adel-Moumen commented Apr 22, 2024 •

edited