[BUG]: GPU memory leak during GAN-based vocoders training #181

vanIvan · 2024-04-11T15:09:51Z

Describe the bug

I'm trying to train GAN-based vocoders like HiFi-GAN, APNet, on the medium sized dataset (~ half of LibriTTS clean-100 + clean-300), however training fails with CUDA out of memory error in the end of the first epoch, I was able to fight it by lowering batch size, however in the beginning training doesn't take even half of provided GPU memory, and somewhere closer to the end of epoch growth.

How To Reproduce

I'm using recent master branch commit 5cb75d8d605ef12c90c64ba2e04919f4d5d834a1
1 x L4 GPU (20Gb), and following base config
I'm having 2375 steps per epoch with bs = 32 for my dataset
Base config for vocoder looks like this:

{
  "base_config": "config/vocoder.json",
  "model_type": "GANVocoder",
  "dataset": [
    "ljspeech",
  ],
  "dataset_path": {
    "ljspeech": "/mnt/data/libritts_subset_vocoder/train/",
  },
  "log_dir": "/mnt/experiments/ckpts/vocoder",
  "preprocess": {
    "extract_mel": true,
    "extract_audio": true,
    "extract_pitch": false,
    "extract_uv": false,
    "pitch_extractor": "parselmouth",

    "use_mel": true,
    "use_frame_pitch": false,
    "use_uv": false,
    "use_audio": true,

    "processed_dir": "/mnt/data/processed_vocoder_data/",
      
    "n_mel": 80,
    "sample_rate": 16000
  },
  "model": {
    "discriminators": [
      "msd",
      "mpd",
      "msstftd",
    ],
    "mpd": {
      "mpd_reshapes": [
        2,
        3,
        5,
        7,
        11
      ],
      "use_spectral_norm": false,
      "discriminator_channel_mult_factor": 1
    },
    "mrd": {
      "resolutions": [[1024, 120, 600], [2048, 240, 1200], [512, 50, 240]],
      "use_spectral_norm": false,
      "discriminator_channel_mult_factor": 1,
      "mrd_override": false
    },
    "msstftd": {
        "filters": 32
    },
    "mssbcqtd": {
      hop_lengths: [512, 256, 256],
      filters: 32,
      max_filters: 1024,
      filters_scale: 1,
      dilations: [1, 2, 4],
      in_channels: 1,
      out_channels: 1,
      n_octaves: [9, 9, 9],
      bins_per_octaves: [24, 36, 48]
    },
  },
  "train": {
    "batch_size": 32,
    "max_epoch": 250,
    "save_checkpoint_stride": [10],
    "adamw": {
        "lr": 2.0e-4,
        "adam_b1": 0.8,
        "adam_b2": 0.99
    },
    "exponential_lr": {
        "lr_decay": 0.998
    },
  }
}

Expected behavior

Expect for GPU memory to be constant during training.

Screenshots

(If applicable, add screenshots to help explain your problem.)

Environment Information

Operating System: SMP Debian 5.10.209-2
Python Version: Python 3.9.15
Driver & CUDA Version: 535.86.10

The text was updated successfully, but these errors were encountered:

VocodexElysium · 2024-04-29T07:10:38Z

Hi! Thanks for your report!
Our default training recipe is tailored for a RTX4090 GPU with 24G GPU memory, if your GPU memory is smaller than that, you need to use a smaller batch size like 16.
GPU memory increase at the end of some batch is not a BUG, to validate the training process, we will inference some long audio samples during the training for human listening, which will take larger GPU memory in some specific epoches.

vanIvan · 2024-05-20T18:18:04Z

Got you, thanks for commenting on this issue. I've reduced batchsize even more and now it's not falling during training.

vanIvan added the bug Something isn't working label Apr 11, 2024

RMSnow assigned VocodexElysium Apr 12, 2024

vanIvan closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: GPU memory leak during GAN-based vocoders training #181

[BUG]: GPU memory leak during GAN-based vocoders training #181

vanIvan commented Apr 11, 2024

VocodexElysium commented Apr 29, 2024

vanIvan commented May 20, 2024

[BUG]: GPU memory leak during GAN-based vocoders training #181

[BUG]: GPU memory leak during GAN-based vocoders training #181

Comments

vanIvan commented Apr 11, 2024

Describe the bug

How To Reproduce

Expected behavior

Screenshots

Environment Information

VocodexElysium commented Apr 29, 2024

vanIvan commented May 20, 2024