Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning or training from scratch in a differente language? #197

Open
paulovasconcellos-hotmart opened this issue Jan 30, 2024 · 21 comments
Labels
help wanted Extra attention is needed

Comments

@paulovasconcellos-hotmart

Hi everyone,
I'm considering putting some effort into training StyleTTS in Portuguese. I have a good-quality dataset for this task, however, I was in doubt if it would be better just to fine-tune the model (which I know was trained in English), or (since it's an unseen language) train the model from scratch in Portuguese.

Does anyone have some tips on what I should consider before making a decision?

@martinambrus
Copy link

Definitelly train a new PLBERT for a new language. You can try with the one trained in English but even the author says it probably won't work.

@rlenain
Copy link

rlenain commented Feb 28, 2024

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

@paulovasconcellos-hotmart
Copy link
Author

Thank you very much for this @rlenain . I'll use this model to train StyleTTS on my data

@Stardust-minus
Copy link

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

Nice work!Did the Chinese data the model used for training include tone?

@rlenain
Copy link

rlenain commented Feb 29, 2024

I'm not sure -- you can see a sample here (the data is from this dataset: https://huggingface.co/datasets/styletts2-community/multilingual-phonemes-10k-alpha/viewer/zh).
image

@Frederieke93
Copy link

Thank you very much @rlenain! This is a great addition! You mentioned you can just finetune on a new language instead of training a new base model, I'd like to try it. How large are the datasets you used for the finetuning on a new language?

@rlenain
Copy link

rlenain commented Mar 5, 2024

i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

@casic
Copy link

casic commented Mar 6, 2024 via email

@rlenain
Copy link

rlenain commented Mar 6, 2024

@casic
Copy link

casic commented Mar 6, 2024 via email

@yl4579 yl4579 added the help wanted Extra attention is needed label Mar 7, 2024
@ZYJGO
Copy link

ZYJGO commented Mar 19, 2024

@rlenain > i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

thanks for the great work! do you have some samples to share? I'm very curious about the quality on a new language

@rlenain
Copy link

rlenain commented Mar 21, 2024

Unfortunately because of the privacy policy of the samples that I trained on, I cannot share these samples here. What I can say is that the quality is very much on-par with samples you can find on the samples page in English.

@traderpedroso
Copy link

traderpedroso commented Apr 3, 2024

Unfortunately because of the privacy policy of the samples that I trained on, I cannot share these samples here. What I can say is that the quality is very much on-par with samples you can find on the samples page in English.

I would like to ask three questions: Do the speakers in the dataset need to be in a numeric format, for example, speaker 0, 1, 2, and do they have to be different from 0, or can I put all of them with the same name or even in a string format like a name to facilitate the recognition of the speakers? The other question is, after training the speakers, to access them, do I need to define the speakers in the inference and what about the language selector is automatic?

@sch0ngut
Copy link

@rlenain

i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

@rlenain Do you mind sharing for how many epochs you fine-tuned?

@rlenain
Copy link

rlenain commented Apr 30, 2024

@sch0ngut Generally for 50k-100k iterations, whatever that means in terms of epochs for the size of your dataset. But you should be following the validation curve.

@21sK1p
Copy link

21sK1p commented May 1, 2024

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

@rlenain what would I need to do if have to train it in hindi language?

@rlenain
Copy link

rlenain commented May 1, 2024

You can probably just finetune StyleTTS2 without changing the PL-BERT model, and it would work, with the right data and amount of data.
If you want to train PL-BERT on Hindi, I believe there's data here: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert

@JingchengYang4
Copy link

@rlenain Regarding this multilingual pl-bert, it appears the data used to train this model uses a data-processing script that's unavailable to the general public - how would we be able to tokenize the training data for StyleTTS in the same form as the Bert model?

@rlenain
Copy link

rlenain commented May 2, 2024

the data here (https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert) has been tokenized using the tokenizer of the bert-multilingual-base-cased model: https://huggingface.co/google-bert/bert-base-multilingual-cased

@chocolatedesue
Copy link

chocolatedesue commented May 8, 2024

Hello @rlenain,

I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.

However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?

Here's what I've done so far:

  1. Converted the source WAV files to a 24k WAV format.
  2. Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
  3. Conducted training on eight 3090 cards for 12 hours without any other modifications.

first stage loss graph
image

Appended

  1. with debug , i find the first nan comes from
    F0_fake, N_fake = model.predictor.F0Ntrain(p_en, s_dur)

@chocolatedesue
Copy link

chocolatedesue commented May 8, 2024

Hello @rlenain,

I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.

However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?

Here's what I've done so far:

  1. Converted the source WAV files to a 24k WAV format.
  2. Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
  3. Conducted training on eight 3090 cards for 12 hours without any other modifications.

first stage loss graph image

Appended

  1. with debug , i find the first nan comes from
    F0_fake, N_fake = model.predictor.F0Ntrain(p_en, s_dur)

solve it, just a bad config that casuing the first stage params loads to second stage model params

I should config first_stage_path instead of pretrained_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests