Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: MultiGPU TTA training #159

Open
fpicetti opened this issue Mar 15, 2024 · 2 comments
Open

[Help]: MultiGPU TTA training #159

fpicetti opened this issue Mar 15, 2024 · 2 comments
Assignees

Comments

@fpicetti
Copy link

Problem Overview

I'd like to train a TTA model (following your examples) in a multiGPU environment (i.e., 4 A100) but I have been unsuccessful so far.

Steps Taken

  1. prepared AudioCaps dataset
  2. fix typos in the base config files for both autoencoderkl and audioldm folders
  3. updated json and sh files according with my dataset
  4. launched the train script with sh egs/tta/autoencoderkl/run_train.sh, no further modification -> it works on the first GPU, as expected
  5. modified run_train.sh#L19 as `export CUDA_VISIBLE_DEVICES="0,1,2,3" -> it works on the first GPU only
  6. keeping point 4, also changed exp_config.json#L38 to "ddp": true -> fails, it asks for all the distribution parameters (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT)
  7. reverted 4 and 5, and thought to leverage accelerate: run accelerate config to set up a single node multiGPU training. accelerate test works fine on the 4 GPUs.
  8. Removed run_train.sh#L19 and modified run_train.sh#L22 to accelerate launch "${work_dir}"/bins/tta/train_tta.py -> I see 4 processes on the first GPU, then it goes OOM.

Expected Outcome

A single train job on 4 GPUs.

Environment Information

  • Operating System: Ubuntu 22.04 LTS
  • Python Version: Python 3.9.15 (conda env created following your instruction)
  • Driver & CUDA Version: CUDA 12.2, Driver 535.86.10
  • Error Messages and Logs: See Steps Taken above
@fpicetti
Copy link
Author

@HeCheng0625 any update on this?

@HeCheng0625
Copy link
Collaborator

Hi, TTA now only supports single GPU training, you can refer to other tasks to implement multi-card training based on accelerate. Welcome to submit PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants