SLM Adversarial Training did not start when finetuning #227

godspirit00 · 2024-04-10T01:01:56Z

I tried to do finetuning on a small dataset with 2 speakers. I set epochs=25, diff_epoch=8, joint_epoch=15.
The Style Diffusion training started as expected, but SLM Adversarial Training never started throughout the entire finetuning process.

My config is

log_dir: "Models/xxx"
save_freq: 1
log_interval: 10
device: "cuda"
epochs: 25 # number of finetuning epoch (1 hour of data)
batch_size: 2
max_len: 400 # maximum number of frames
pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
  train_data: "Data/xxx__train.txt"
  val_data: "Data/xxx__val.txt"
  root_path: ""
  OOD_data: "Data/OOD_texts.txt"
  min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  multispeaker: true

  dim_in: 64 
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

  n_token: 178 # number of phoneme tokens
  max_dur: 50 # maximum duration of a single phoneme
  style_dim: 128 # style vector size
  
  dropout: 0.2

  # config for decoder
  decoder: 
      type: 'hifigan' # either hifigan or istftnet
      resblock_kernel_sizes: [3,7,11]
      upsample_rates :  [10,5,3,2]
      upsample_initial_channel: 512
      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
      upsample_kernel_sizes: [20,10,6,4]
      
  # speech language model config
  slm:
      model: 'microsoft/wavlm-base-plus'
      sr: 16000 # sampling rate of SLM
      hidden: 768 # hidden size of SLM
      nlayers: 13 # number of layers of SLM
      initial_channel: 64 # initial channels of SLM discriminator head
  
  # style diffusion model config
  diffusion:
    embedding_mask_proba: 0.1
    # transformer config
    transformer:
      num_layers: 3
      num_heads: 8
      head_features: 64
      multiplier: 2

    # diffusion distribution config
    dist:
      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
      mean: -3.0
      std: 1.0
  
loss_params:
    lambda_mel: 5. # mel reconstruction loss
    lambda_gen: 1. # generator loss
    lambda_slm: 1. # slm feature matching loss
    
    lambda_mono: 1. # monotonic alignment loss (TMA)
    lambda_s2s: 1. # sequence-to-sequence loss (TMA)

    lambda_F0: 1. # F0 reconstruction loss
    lambda_norm: 1. # norm reconstruction loss
    lambda_dur: 1. # duration loss
    lambda_ce: 20. # duration predictor probability output CE loss
    lambda_sty: 1. # style reconstruction loss
    lambda_diff: 1. # score matching loss
    
    diff_epoch: 8 # style diffusion starting epoch
    joint_epoch: 15 # joint training starting epoch

optimizer_params:
  lr: 0.0001 # general learning rate
  bert_lr: 0.00001 # learning rate for PLBERT
  ft_lr: 0.0001 # learning rate for acoustic modules
  
slmadv_params:
  min_len: 400 # minimum length of samples
  max_len: 500 # maximum length of samples
  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
  iter: 10 # update the discriminator every this iterations of generator update
  thresh: 5 # gradient norm above which the gradient is scaled
  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
  sig: 1.5 # sigma for differentiable duration modeling

What have I missed? Thanks!

The text was updated successfully, but these errors were encountered:

DogeLord081 · 2024-04-19T14:43:11Z

Same issue

meng2468 · 2024-04-22T15:07:01Z

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File

epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)

Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

DogeLord081 · 2024-04-22T19:08:09Z

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

Thanks, do I also need the first_stage_path:?

DogeLord081 · 2024-05-01T14:05:15Z

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

This did not fix the issue unfortunately

78Alpha · 2024-05-04T05:21:42Z

I've managed to get to:

        for bib in range(len(output_lengths)):
            mel_length_pred = output_lengths[bib]
            mel_length_gt = int(mel_input_length[bib].item() / 2)
            if mel_length_gt <= mel_len or mel_length_pred <= mel_len:
                continue
            sp.append(s_preds[bib])

            random_start = np.random.randint(0, mel_length_pred - mel_len)
            en.append(asr_pred[bib, :, random_start:random_start+mel_len])
            p_en.append(p_pred[bib, :, random_start:random_start+mel_len])

            # get ground truth clips
            random_start = np.random.randint(0, mel_length_gt - mel_len)
            y = waves[bib][(random_start * 2) * 300:((random_start+mel_len) * 2) * 300]
            wav.append(torch.from_numpy(y).to(ref_text.device))
            
            if len(wav) >= self.batch_percentage * len(waves): # prevent OOM due to longer lengths
                break
        if len(sp) <= 1:
            return None

Aa where things go sour. If you have a batch size of 2, then it will always be 1, meaning SLMADV never starts. You need to change batch_percentage: 0.5 to at least batch_percentage: 1. However, if you were running at the edge of memory before (Example, 20 GB/24 GB) then you will not be able to comfortably use this unless you have shared memory (28.4 GB/24 GB). It will also take 8 times longer unless you really crank down the min/max slmadv length. So instead of a max of 500, it would be a max of 220 or so.

godspirit00 · 2024-05-04T14:50:14Z

@78Alpha
I tried with batch size 2, batch_percentage: 1, max_len: 600; batch size 6, batch_percentage: 0.8, max_len: 200; batch size 4, batch_percentage: 0.5, max_len: 400, but the results were the same, joint_epoch was reached, and occasionally slm_out at L502 was not None, but still SLMADV training didn't seem to start, as the related losses were still zero.

78Alpha · 2024-05-04T16:39:03Z

They're going to be zero for a while unless the conditions it's looking for are met. On about 1 epoch of training, my tensorboard only showed 60 steps worth of SLM training when i set batch percentage to 1. I don't know what exactly is looking for.

GUUser91 · 2024-05-13T00:36:31Z

@78Alpha
I'm not sure if I'm doing this correctly, but does this image from my tensorboard folder mean that I was able to start SLM Adversarial Training?

78Alpha · 2024-05-13T06:26:15Z

Yeah, that's what it should look like. All graphs filled is the sign of all parts working.

GUUser91 · 2024-05-16T06:44:23Z

I tinkered around with the config_ft.yml file. I set Max_Len to 120. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 120. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are no longer at 0. I'm using a rx 7900 xtx. Note I'm training a model with style diffusion in one fine-tuning session and adversarial training in another session.

Here a pic from my tensorboard folder.

Edit: I discovered I can do style diffusion and SLM adversarial training together in one session. I set max_len to 252, epoch set to 100, batch_size set to 2, batch_percentage set to 1, slmadv_params min_len set to 180, slmadv_params max_len set to 190, diff_epoch to 10, joint_epoch to 50, I'm using the vokan model as the base model.

I also rented out a h100 from runpod and slmadv_params and slmadv_params max_len are at default settings ( min_len: 400 and max_len: 500), batch size at 2 and batch_percentage to 1 and SLM adversarial training never started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLM Adversarial Training did not start when finetuning #227

SLM Adversarial Training did not start when finetuning #227

godspirit00 commented Apr 10, 2024

DogeLord081 commented Apr 19, 2024

meng2468 commented Apr 22, 2024

DogeLord081 commented Apr 22, 2024

DogeLord081 commented May 1, 2024

78Alpha commented May 4, 2024

godspirit00 commented May 4, 2024

78Alpha commented May 4, 2024

GUUser91 commented May 13, 2024

78Alpha commented May 13, 2024

GUUser91 commented May 16, 2024 •

edited

SLM Adversarial Training did not start when finetuning #227

SLM Adversarial Training did not start when finetuning #227

Comments

godspirit00 commented Apr 10, 2024

DogeLord081 commented Apr 19, 2024

meng2468 commented Apr 22, 2024

DogeLord081 commented Apr 22, 2024

DogeLord081 commented May 1, 2024

78Alpha commented May 4, 2024

godspirit00 commented May 4, 2024

78Alpha commented May 4, 2024

GUUser91 commented May 13, 2024

78Alpha commented May 13, 2024

GUUser91 commented May 16, 2024 • edited

GUUser91 commented May 16, 2024 •

edited