Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with not so much data? #226

Open
Sweetapocalyps3 opened this issue Apr 2, 2024 · 3 comments

Comments

@Sweetapocalyps3
Copy link

Sweetapocalyps3 commented Apr 2, 2024

Hi everyone,

I'm wondering if it should be LJSpeech or LibriTTS the proper candidate to be used to finetune a single person voice.
I've seen that there is a multispeaker boolean field in the configuration, which in my case should be set to false, but I don't know if this imply I have to use LJSpeech, since LibriTTS is a multispeaker.

Maybe is it even better to train the model from scratch? I'm thinking about it, but I suppose I have too few samples (126 files of clean audio for a total of almost 19 minutes)

Thank you in advance.

@Sweetapocalyps3 Sweetapocalyps3 changed the title Better LJSpeech or LibriTTS for finetuning a single speaker voice? Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with little data? Apr 2, 2024
@Sweetapocalyps3 Sweetapocalyps3 changed the title Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with little data? Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with not so much data? Apr 2, 2024
@meng2468
Copy link

LibriTTS is by far the better choice, the model has seen multiple speakers, and can adapt far better to a smaller dataset for a single speaker.

You can leave all of the settings in config_ft.yml the same (Changing only dataset, then batch size and window size depending on your hardware). Multi-speaker should be kept on true, just make sure that in your dataset metafiles the speaker_id is set to the same id for each file.

Training the model from scratch from with 19 minutes of data will most likely yield bad results, although I haven't tried myself.

Helpful details on fine-tuning: #81

@GUUser91
Copy link

You can use vokan.
https://huggingface.co/ShoukanLabs/Vokan

@traderpedroso
Copy link

You can use vokan.

https://huggingface.co/ShoukanLabs/Vokan

The expressions and emphasis in the voices sound really natural, but there are always noises at the beginning and especially at the end. I believe a pad of silence at the start and end was missing during the training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants