Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: GPU memory leak during GAN-based vocoders training #181

Closed
vanIvan opened this issue Apr 11, 2024 · 2 comments
Closed

[BUG]: GPU memory leak during GAN-based vocoders training #181

vanIvan opened this issue Apr 11, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@vanIvan
Copy link

vanIvan commented Apr 11, 2024

Describe the bug

I'm trying to train GAN-based vocoders like HiFi-GAN, APNet, on the medium sized dataset (~ half of LibriTTS clean-100 + clean-300), however training fails with CUDA out of memory error in the end of the first epoch, I was able to fight it by lowering batch size, however in the beginning training doesn't take even half of provided GPU memory, and somewhere closer to the end of epoch growth.

How To Reproduce

  1. I'm using recent master branch commit 5cb75d8d605ef12c90c64ba2e04919f4d5d834a1
  2. 1 x L4 GPU (20Gb), and following base config
  3. I'm having 2375 steps per epoch with bs = 32 for my dataset
  4. Base config for vocoder looks like this:
{
  "base_config": "config/vocoder.json",
  "model_type": "GANVocoder",
  "dataset": [
    "ljspeech",
  ],
  "dataset_path": {
    "ljspeech": "/mnt/data/libritts_subset_vocoder/train/",
  },
  "log_dir": "/mnt/experiments/ckpts/vocoder",
  "preprocess": {
    "extract_mel": true,
    "extract_audio": true,
    "extract_pitch": false,
    "extract_uv": false,
    "pitch_extractor": "parselmouth",

    "use_mel": true,
    "use_frame_pitch": false,
    "use_uv": false,
    "use_audio": true,

    "processed_dir": "/mnt/data/processed_vocoder_data/",
      
    "n_mel": 80,
    "sample_rate": 16000
  },
  "model": {
    "discriminators": [
      "msd",
      "mpd",
      "msstftd",
    ],
    "mpd": {
      "mpd_reshapes": [
        2,
        3,
        5,
        7,
        11
      ],
      "use_spectral_norm": false,
      "discriminator_channel_mult_factor": 1
    },
    "mrd": {
      "resolutions": [[1024, 120, 600], [2048, 240, 1200], [512, 50, 240]],
      "use_spectral_norm": false,
      "discriminator_channel_mult_factor": 1,
      "mrd_override": false
    },
    "msstftd": {
        "filters": 32
    },
    "mssbcqtd": {
      hop_lengths: [512, 256, 256],
      filters: 32,
      max_filters: 1024,
      filters_scale: 1,
      dilations: [1, 2, 4],
      in_channels: 1,
      out_channels: 1,
      n_octaves: [9, 9, 9],
      bins_per_octaves: [24, 36, 48]
    },
  },
  "train": {
    "batch_size": 32,
    "max_epoch": 250,
    "save_checkpoint_stride": [10],
    "adamw": {
        "lr": 2.0e-4,
        "adam_b1": 0.8,
        "adam_b2": 0.99
    },
    "exponential_lr": {
        "lr_decay": 0.998
    },
  }
}

Expected behavior

Expect for GPU memory to be constant during training.

Screenshots

(If applicable, add screenshots to help explain your problem.)

Environment Information

  • Operating System: SMP Debian 5.10.209-2
  • Python Version: Python 3.9.15
  • Driver & CUDA Version: 535.86.10
@vanIvan vanIvan added the bug Something isn't working label Apr 11, 2024
@VocodexElysium
Copy link
Collaborator

Hi! Thanks for your report!
Our default training recipe is tailored for a RTX4090 GPU with 24G GPU memory, if your GPU memory is smaller than that, you need to use a smaller batch size like 16.
GPU memory increase at the end of some batch is not a BUG, to validate the training process, we will inference some long audio samples during the training for human listening, which will take larger GPU memory in some specific epoches.

@vanIvan
Copy link
Author

vanIvan commented May 20, 2024

Got you, thanks for commenting on this issue. I've reduced batchsize even more and now it's not falling during training.

@vanIvan vanIvan closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants