New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Help]: MultiGPU TTA training #159
Comments
@HeCheng0625 any update on this? |
Hi, TTA now only supports single GPU training, you can refer to other tasks to implement multi-card training based on accelerate. Welcome to submit PRs. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem Overview
I'd like to train a TTA model (following your examples) in a multiGPU environment (i.e., 4 A100) but I have been unsuccessful so far.
Steps Taken
AudioCaps
datasetautoencoderkl
andaudioldm
folderssh egs/tta/autoencoderkl/run_train.sh
, no further modification -> it works on the first GPU, as expected"ddp": true
-> fails, it asks for all the distribution parameters (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT)accelerate
: runaccelerate config
to set up a single node multiGPU training.accelerate test
works fine on the 4 GPUs.accelerate launch "${work_dir}"/bins/tta/train_tta.py
-> I see 4 processes on the first GPU, then it goes OOM.Expected Outcome
A single train job on 4 GPUs.
Environment Information
Steps Taken
aboveThe text was updated successfully, but these errors were encountered: