Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLM Adversarial Training did not start when finetuning #227

Open
godspirit00 opened this issue Apr 10, 2024 · 10 comments
Open

SLM Adversarial Training did not start when finetuning #227

godspirit00 opened this issue Apr 10, 2024 · 10 comments

Comments

@godspirit00
Copy link

I tried to do finetuning on a small dataset with 2 speakers. I set epochs=25, diff_epoch=8, joint_epoch=15.
The Style Diffusion training started as expected, but SLM Adversarial Training never started throughout the entire finetuning process.

QQ截图20240410073701

My config is

log_dir: "Models/xxx"
save_freq: 1
log_interval: 10
device: "cuda"
epochs: 25 # number of finetuning epoch (1 hour of data)
batch_size: 2
max_len: 400 # maximum number of frames
pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
  train_data: "Data/xxx__train.txt"
  val_data: "Data/xxx__val.txt"
  root_path: ""
  OOD_data: "Data/OOD_texts.txt"
  min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  multispeaker: true

  dim_in: 64 
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

  n_token: 178 # number of phoneme tokens
  max_dur: 50 # maximum duration of a single phoneme
  style_dim: 128 # style vector size
  
  dropout: 0.2

  # config for decoder
  decoder: 
      type: 'hifigan' # either hifigan or istftnet
      resblock_kernel_sizes: [3,7,11]
      upsample_rates :  [10,5,3,2]
      upsample_initial_channel: 512
      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
      upsample_kernel_sizes: [20,10,6,4]
      
  # speech language model config
  slm:
      model: 'microsoft/wavlm-base-plus'
      sr: 16000 # sampling rate of SLM
      hidden: 768 # hidden size of SLM
      nlayers: 13 # number of layers of SLM
      initial_channel: 64 # initial channels of SLM discriminator head
  
  # style diffusion model config
  diffusion:
    embedding_mask_proba: 0.1
    # transformer config
    transformer:
      num_layers: 3
      num_heads: 8
      head_features: 64
      multiplier: 2

    # diffusion distribution config
    dist:
      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
      mean: -3.0
      std: 1.0
  
loss_params:
    lambda_mel: 5. # mel reconstruction loss
    lambda_gen: 1. # generator loss
    lambda_slm: 1. # slm feature matching loss
    
    lambda_mono: 1. # monotonic alignment loss (TMA)
    lambda_s2s: 1. # sequence-to-sequence loss (TMA)

    lambda_F0: 1. # F0 reconstruction loss
    lambda_norm: 1. # norm reconstruction loss
    lambda_dur: 1. # duration loss
    lambda_ce: 20. # duration predictor probability output CE loss
    lambda_sty: 1. # style reconstruction loss
    lambda_diff: 1. # score matching loss
    
    diff_epoch: 8 # style diffusion starting epoch
    joint_epoch: 15 # joint training starting epoch

optimizer_params:
  lr: 0.0001 # general learning rate
  bert_lr: 0.00001 # learning rate for PLBERT
  ft_lr: 0.0001 # learning rate for acoustic modules
  
slmadv_params:
  min_len: 400 # minimum length of samples
  max_len: 500 # maximum length of samples
  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
  iter: 10 # update the discriminator every this iterations of generator update
  thresh: 5 # gradient norm above which the gradient is scaled
  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
  sig: 1.5 # sigma for differentiable duration modeling

What have I missed? Thanks!

@DogeLord081
Copy link

Same issue

@meng2468
Copy link

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File

epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)

Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

@DogeLord081
Copy link

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File

epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)

Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

Thanks, do I also need the first_stage_path:?

@DogeLord081
Copy link

You seem to be missing the configuration options for when the second training starts

See lines 6 and 7 in the LibriTTS Config File

epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)

Should be able to kick-off second stage training by loading your current model checkpoint and setting epochs_1st to 0

This did not fix the issue unfortunately

@78Alpha
Copy link

78Alpha commented May 4, 2024

I've managed to get to:

        for bib in range(len(output_lengths)):
            mel_length_pred = output_lengths[bib]
            mel_length_gt = int(mel_input_length[bib].item() / 2)
            if mel_length_gt <= mel_len or mel_length_pred <= mel_len:
                continue
            sp.append(s_preds[bib])

            random_start = np.random.randint(0, mel_length_pred - mel_len)
            en.append(asr_pred[bib, :, random_start:random_start+mel_len])
            p_en.append(p_pred[bib, :, random_start:random_start+mel_len])

            # get ground truth clips
            random_start = np.random.randint(0, mel_length_gt - mel_len)
            y = waves[bib][(random_start * 2) * 300:((random_start+mel_len) * 2) * 300]
            wav.append(torch.from_numpy(y).to(ref_text.device))
            
            if len(wav) >= self.batch_percentage * len(waves): # prevent OOM due to longer lengths
                break
        if len(sp) <= 1:
            return None

Aa where things go sour. If you have a batch size of 2, then it will always be 1, meaning SLMADV never starts. You need to change batch_percentage: 0.5 to at least batch_percentage: 1. However, if you were running at the edge of memory before (Example, 20 GB/24 GB) then you will not be able to comfortably use this unless you have shared memory (28.4 GB/24 GB). It will also take 8 times longer unless you really crank down the min/max slmadv length. So instead of a max of 500, it would be a max of 220 or so.

@godspirit00
Copy link
Author

@78Alpha
I tried with batch size 2, batch_percentage: 1, max_len: 600; batch size 6, batch_percentage: 0.8, max_len: 200; batch size 4, batch_percentage: 0.5, max_len: 400, but the results were the same, joint_epoch was reached, and occasionally slm_out at L502 was not None, but still SLMADV training didn't seem to start, as the related losses were still zero.

@78Alpha
Copy link

78Alpha commented May 4, 2024

They're going to be zero for a while unless the conditions it's looking for are met. On about 1 epoch of training, my tensorboard only showed 60 steps worth of SLM training when i set batch percentage to 1. I don't know what exactly is looking for.

@GUUser91
Copy link

@78Alpha
I'm not sure if I'm doing this correctly, but does this image from my tensorboard folder mean that I was able to start SLM Adversarial Training?
screencapture-localhost-6006-2024-05-12-17_32_05

@78Alpha
Copy link

78Alpha commented May 13, 2024

Yeah, that's what it should look like. All graphs filled is the sign of all parts working.

@GUUser91
Copy link

GUUser91 commented May 16, 2024

I tinkered around with the config_ft.yml file. I set Max_Len to 120. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 120. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are no longer at 0. I'm using a rx 7900 xtx. Note I'm training a model with style diffusion in one fine-tuning session and adversarial training in another session.
Loss Training Screenshot_20240515_233529

Here a pic from my tensorboard folder.
tensorboard update screencapture-localhost-6006-2024-05-15-23_49_50

Edit: I discovered I can do style diffusion and SLM adversarial training together in one session. I set max_len to 252, epoch set to 100, batch_size set to 2, batch_percentage set to 1, slmadv_params min_len set to 180, slmadv_params max_len set to 190, diff_epoch to 10, joint_epoch to 50, I'm using the vokan model as the base model.

I also rented out a h100 from runpod and slmadv_params and slmadv_params max_len are at default settings ( min_len: 400 and max_len: 500), batch size at 2 and batch_percentage to 1 and SLM adversarial training never started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants