Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP at eval stage + DDP metrics #2518

Draft
wants to merge 37 commits into
base: develop
Choose a base branch
from

Conversation

Adel-Moumen
Copy link
Collaborator

@Adel-Moumen Adel-Moumen commented Apr 22, 2024

What does this PR do?

Goal: Add support for distributed metrics inside SpeechBrain.

Why: We are currently running the training set on a distributed system, which significantly reduces training time of SpeechBrain's. However, due to a lack of supported features for distributed metrics, we are performing the evaluation stages (validation & testing) on a single GPU, despite potentially having access to many more. This practice renders SpeechBrain noncompetitive compared to other toolkits because the evaluation inference time is notably slow. Utilizing all available GPUs during testing could significantly reduce the testing time, potentially dividing it by N minus a small constant linked to the communication overhead, where N is the number of GPUs. For example, training a conformer transducer model typically requires 4x GPUs, while we are using only 1 GPU during evaluation stages, which can prolong the process to hours. Leveraging all 4x GPUs can offer a considerable speedup. This pull request (PR) is also part of our efforts to make SpeechBrain an attractive toolkit for "large-scale" training which will need to run evaluation on a tons of data.

How: To achieve this, I introduced various functions to gather tensors/objects across processes. Please refer to the file distributed_metrics.py for a better understanding of the available functions.

Issues: Since this PR can significantly enhance the inference speed of SpeechBrain, it's crucial to introduce it with caution. I've written extensive unit tests covering 100% of the metrics in SpeechBrain, all of which are functional in Distributed Data Parallel (DDP) mode. Additionally, I manually executed some Automatic Speech Recognition (ASR) recipes and verified the accuracy of the results. Based on my tests, we observe consistent results, indicating that this PR should not compromise our outcomes. However, I've only manually covered ASR tests, and I believe it's prudent to dedicate some time to investigating Text-to-Speech (TTS) models as well.

Results

Note: we are seeing some very smalls varation sometimes like 1.91 CER vs 1.92. I believe the issue is due to how CUDA works. Since you are now using multiple GPUs, they might not use the same CUDA kernel for doing operations.

DDP No Sync Metrics

w2v2 + CTC + librispeech + 2 GPUs (no syncs) + test BS = 1

100%|██████████| 2620/2620 [02:33<00:00, 17.03it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 3.16e-02, test CER: 5.10e-01, test WER: 1.92
100%|██████████| 2939/2939 [02:39<00:00, 18.42it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 7.46e-02, test CER: 1.28, test WER: 3.97

Whisper Transformer + librispeech + 2 GPUs (no sync) + test BS = 1

100%|██████████| 2703/2703 [07:45<00:00,  5.80it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.53, test CER: 1.91, test WER: 5.07
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
100%|██████████| 2620/2620 [07:33<00:00,  5.78it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.55, test CER: 1.86, test WER: 4.99
100%|██████████| 2939/2939 [07:10<00:00,  6.82it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.91, test CER: 5.78, test WER: 12.20

DDP Sync Metrics

w2v2 + CTC + librispeech + 2 GPUs + test BS = 1

100%|██████████| 1310/1310 [01:48<00:00, 12.02it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 3.16e-02, test CER: 5.10e-01, test WER: 1.92
100%|██████████| 1470/1470 [01:45<00:00, 13.95it/s]
speechbrain.utils.train_logger - Epoch loaded: 1 - test loss: 7.46e-02, test CER: 1.28, test WER: 3.97

Whisper Transformer + librispeech + 2 GPUs (no sync) + test BS = 1

100%|██████████| 1352/1352 [04:48<00:00,  4.68it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.53, test CER: 1.92, test WER: 5.08
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
100%|██████████| 1310/1310 [04:17<00:00,  5.09it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.55, test CER: 1.86, test WER: 4.99
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
100%|██████████| 1470/1470 [04:22<00:00,  5.60it/s]
speechbrain.utils.train_logger - Epoch loaded: 0 - test loss: 1.91, test CER: 5.79, test WER: 12.20

To Do

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/core.py Show resolved Hide resolved
speechbrain/utils/distributed_metrics.py Outdated Show resolved Hide resolved
speechbrain/utils/distributed_metrics.py Show resolved Hide resolved
speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/utils/distributed.py Show resolved Hide resolved
Copy link
Collaborator

@pplantinga pplantinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting eval/validation with DDP is an important step, thanks for tackling it @Adel-Moumen . Tests for DDP and making metrics work across processes are crucial. However I think there may be some unnecessary complexity here. For example, I'm not convinced the DistributedState singleton adds anything on top of what torch.distributed already does.

Some options I can think of to address complexity:

  • add accelerate as a dependency and use their version of these functions, so we don't have to maintain it.
  • add torch-metrics as a dependency, https://github.com/Lightning-AI/torchmetrics, they have most of the metrics with distributed support already, again we don't have to maintain it.
  • Force a relatively simple and consistent format for the metrics -- then we don't need much complexity to sync other than torch.distributed

I know we're trying to avoid adding dependencies but these ones are relatively small and well worth it.

speechbrain/utils/distributed.py Outdated Show resolved Hide resolved
speechbrain/core.py Outdated Show resolved Hide resolved
speechbrain/utils/distributed.py Outdated Show resolved Hide resolved
speechbrain/utils/distributed.py Outdated Show resolved Hide resolved
speechbrain/utils/distributed_metrics.py Outdated Show resolved Hide resolved
speechbrain/utils/distributed_metrics.py Show resolved Hide resolved
speechbrain/utils/metric_stats.py Show resolved Hide resolved
@Adel-Moumen
Copy link
Collaborator Author

However I think there may be some unnecessary complexity here. For example, I'm not convinced the DistributedState singleton adds anything on top of what torch.distributed already does.

Hi @pplantinga, indeed. It does not in the current stage. As I said in my replies, this PoC aimed at showing one way of handling DDP informations in the future. Indeed, depending on your backend, you may need to follow different steps to achieve the same thing. Having a convenient dataclass would have helped to have a single interface and behind the hood all the logic required by those backends.

With respect to your comments, I decided to remove this class as it can be fully replaced by torch.distributed functions.

Some options I can think of to address complexity

Regarding the complexity, I have some hard time to understand where the complexity boils down. I am slightly biased since I am the one who spent the most time understanding how to synchronize metrics. The only complexity is due to the different backends that we can encounter in speechbrain (e.g. gloo vs nccl) which comes with different utilities functions but I don't think you can simplify unless we drop gloo support :) (which i strongly disagree).

add accelerate as a dependency and use their version of these functions, so we don't have to maintain it.
add torch-metrics as a dependency, Lightning-AI/torchmetrics, they have most of the metrics with distributed support already, again we don't have to maintain it.
Force a relatively simple and consistent format for the metrics -- then we don't need much complexity to sync other than torch.distributed

Again, I am still biased since I did the implementation but I really dont think we need a third package to handle the metrics. As I demonstrated, synchronizing can be summarized in < 250 lines of code (excluding docstring/tests). If you do install accelerate for metrics, you'll get thousands of lines of code that are useless mostly because they do support TPUs, Megatron, FSDP etc etc etc. I like the idea of having a SpeechBrain Trainer that does not rely on many toolkits and that can be easily hackable.

@Adel-Moumen
Copy link
Collaborator Author

Please jump in the discussion (one more time please its an important topic :p) @TParcollet @asumagic :) (note: i made some changes based on your comments [thanks again guys (@TParcollet @asumagic @pplantinga) !] and feel free to get back to me if you think that now I should move on to more tests/reporting more results etc). :)

@TParcollet
Copy link
Collaborator

@pplantinga I do not like the idea of adding accelerate and torch-metrics as dependencies. I do not trust the stability of both dependencies and I hate depending on other things when it's for something easy, like here. However, I agree that the Singleton class seems to be to much IMHO. I also agree on the homogeneity of the metrics, but maybe this could be another PR?

@lucadellalib
Copy link
Collaborator

The small variations in CER and WER might be related to the fact that when the dataset is not divisible by the number of processes, a few samples are duplicated : pytorch/pytorch#25162

@Adel-Moumen
Copy link
Collaborator Author

The small variations in CER and WER might be related to the fact that when the dataset is not divisible by the number of processes, a few samples are duplicated : pytorch/pytorch#25162

Indeed! Good point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants