Rough fine-tuning guidance #70

RonanKMcGovern · 2024-02-21T16:21:49Z

I know the repo ReadMe says "soon", but would it be possible to give some very rough advice on how to fine-tune to improve on the voice's match with a custom speaker?

I guess the demo is just extracting embeddings from bria.mp3, but I'd like to go one step further to get a better voice match. Thanks.

vatsalaggarwal · 2024-02-21T17:08:46Z

Hey! Yes, when we released the repo, we thought it'd be the next thing we release, but we've reprioritised, so are busy with other things, and we've not actively started working on releasing finetuning support.

In terms of the voice cloning, have you tried to embed the voice you're trying to clone? What was the issue?

In general, we've found that the following works best:

>= 30seconds
non-compressed (i.e. not been run through mp3 compression, etc), high bandwidth audio
high SNR / low reverb (no background noise, close to the mic, etc)
american speakers

In terms of what's required for finetuning, there are 4 models chained together: i) first stage (from text -> 2 hierarchies of encodec), ii) second stage (2 hierarchies of encodec -> rest 6 hierarchies of encodec), iii) mbd (8 hierarchies of encodec -> waveform), iv) deepfilternet (cleanup; waveform -> waveform). In our testing, we found that the second stage is fairly robust to speakers, and accents; we haven't extensively tested it for non-English languages. So depending on what you're trying to do, I'd recommend focusing on finetuning primarily the first stage model 1B param model.

For that, we need:

Dataset preparation: we need 1) text tokens, 2) audio tokens, and 3) speaker embeddings for every utterance in the dataset.
- For text tokens, one can use the tokenizer per https://github.com/metavoiceio/metavoice-src/blob/main/fam/llm/sample.py#L177
- For audio tokens, one needs to extract encodec tokens... here's a gist of some example code (it's copy pasted, and edited, can't guarantee it works as-if): https://gist.github.com/vatsalaggarwal/795d51cca0f8350ba67e9e64dcf34785
- For speaker embedding, you can also follow the code at https://github.com/metavoiceio/metavoice-src/blob/main/fam/llm/sample.py#L435
Training loop: I would wrap the above into a dataloader, with a standard pytorch / lightning training loop alongwith the associated model.py

I might be missing a few things here as I'm putting this down from memory, but happy to assist with things as they come up. Sorry about the delay on this from our end, but we equally welcome contributions, and would move to support that instead!

RonanKMcGovern · 2024-02-22T09:52:58Z

Many thanks @vatsalaggarwal .

Re embeddings, I used a 90 s recording of my voice. I used an mp3 file, so perhaps I could have done better there, but I'm Irish so maybe that was the issue in getting a good match. I'll try out an American person's embeddings instead and see.

Yeah, I think I've got it on training. I guess the first stage produces encodec tokens? What do I use to decode those?

BTW, what model are you using for diffusion in the third stage, I don't see that on the model card on HF, but maybe I glanced over it? Thanks

vatsalaggarwal · 2024-02-22T10:04:53Z

Yeah, and a 44.1/48khz wav or 256kbps mp3 works better… but, it’s unlikely the model would be able to produce an Irish accent regardless in one-shot, at least we didn’t focus on it for this release.

The first stage produces the first two codebooks of the encodec RVQ.

you can decode these using the second-stage + MBD + deepfilternet.

We are using multi band diffusion from audio craft.

More details are available in https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#architecture and #70 (comment)

deeprobo-dev · 2024-02-26T08:49:06Z

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

Thanks for the amazing project

hrachkovinovoto · 2024-02-27T03:22:07Z

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

Thanks for the amazing project

+1, It'd be great if it manages to run on 8 GB VRAM at the very least.

vatsalaggarwal · 2024-02-27T09:11:38Z

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

I think you'd need to finetune both the first stage for sure, and I'm not sure if second stage needs to be finetuned, that's something you'd have to check.

Thanks for the request regarding faster inference and on-device! How much VRAM and compute (TFLOPs) are available?

deeprobo-dev · 2024-03-01T13:36:55Z

I think you'd need to finetune both the first stage for sure, and I'm not sure if second stage needs to be finetuned, that's something you'd have to check.

Thanks for the request regarding faster inference and on-device! How much VRAM and compute (TFLOPs) are available?

Thanks for your insights regarding fine tuning. I will give it a try. For NVIDIA JETSON AGX XAVIER its of 32 TFLOPS and for NVIDIA JETSON AGX ORIN is of 275 TFLOPS.

maepopi · 2024-03-01T16:33:44Z

Hello! If I understood correctly from this thread, it will soon be possible to finetune a model both with a model checkpoint and a LoRA with 12Gb VRAM (and possibly 8gb?)? How large the audio dataset should be for each?

thank you, this is very exciting!

RonanKMcGovern · 2024-03-01T17:13:19Z

Hey! Yes, when we released the repo, we thought it'd be the next thing we release, but we've reprioritised, so are busy with other things, and we've not actively started working on releasing finetuning support.

Just to add to this. I wouldn't underestimate how valuable it would be to release fine-tuning in a simple way. I'm not aware of frameworks that accurately can fine-tune for a specific voice. I'd be keen to make a video on this on youtube.com/@TrelisResearch if this becomes possible - it allows for things like making your own audio books.

maepopi · 2024-03-01T18:42:21Z

Just to add to this. I wouldn't underestimate how valuable it would be to release fine-tuning in a simple way. I'm not aware of frameworks that accurately can fine-tune for a specific voice. I'd be keen to make a video on this on youtube.com/@TrelisResearch if this becomes possible - it allows for things like making your own audio books.

If I might, have you checked out this tool? it's based off Tortoise-TTS and it's really good. I've been playing around with it for months and come up with pretty good models. I don't think it supports LoRA's though, and I'm starting to think that maybe you need rather large datasets for finetuning, which is why I'm very interested in the present repo. In addition, Metavoice seems to provide with a slightly better base model than Tortoise (but this would need to be tested further, it's just an impression for now).

danablend · 2024-03-03T07:54:22Z

Yeah, and a 44.1/48khz wav or 256kbps mp3 works better… but, it’s unlikely the model would be able to produce an Irish accent regardless in one-shot, at least we didn’t focus on it for this release.

The first stage produces the first two codebooks of the encodec RVQ.

you can decode these using the second-stage + MBD + deepfilternet.

We are using multi band diffusion from audio craft.

More details are available in https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#architecture and #70 (comment)

Hey! Thanks very much for your insight.

I'm in the midst of attempting to implement fine tuning, and I've gotten a very simple script to train but I could only get it to work by sequentially iterating over each data entry in a batch.

Would you happen to have an idea of how it might look adding batch inference support, so it can train on batches at a time during training / fine tuning?

Again, very much appreciate your work!

vatsalaggarwal · 2024-03-03T09:12:29Z

@danablend can you push your code to PR? I can have a look

danablend · 2024-03-04T10:22:00Z

@danablend can you push your code to PR? I can have a look

Hey @vatsalaggarwal, I realized that I had made a mistake and needed to build more code to make the training work.

I've spent a few hours working on it, but I get OOM when attempting to train the model with the gradients enabled on an A10G (16 GB VRAM), so I don't know yet if it works to push to the codebase. How many GB VRAM did you find you needed to train the model?

I'll play around with it more and see if I can prepare something useful and clean for you as a base to work off if that would be helpful, and I can open that as a PR?

Probably going to play with PEFT / LoRA approach, given my current resource limitations (someone did it here with LLaMA and VALL-E: https://arxiv.org/pdf/2401.00246, I will try a similar approach)

vatsalaggarwal · 2024-03-04T13:38:18Z

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look at it in it's current state (even if it's in a super dirty state) and give pointers to speed you up.

danablend · 2024-03-04T14:10:53Z

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look it in it's current state and give pointers to speed you up even if it's in a super dirty state.

I'll open a PR shortly

danablend · 2024-03-04T15:23:34Z

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look at it in it's current state (even if it's in a super dirty state) and give pointers to speed you up.

Just added a PR draft (#82).

G-force78 · 2024-03-14T12:19:12Z

Thanks for all the work on the training script its way beyond my ability, I am testing it now and am not sure what --val should be pathed to? Is it the dataset csv file? I have also pathed --train to that as well.
Also, is there a way to set learning rate and number of steps? (Found it, its in .fam/llm/config/finetune_params.py).
From experience using tortoise I found 2000-2500 steps was a range to aim for when training 20minutes of clean audio with no silences.

Edit: ok so that was correct you have to set the argument --train dataset.csv --val valdataset.csv
No how to save every so often eg 1/4 of way 1/2 way then 3/4 to completion.
so far so good...using t4 on google colab

Training: loss 5.9145, time 37337.25ms: : 401it [06:31, 11.51s/it]iter 400: loss 5.9145, time 37337.25ms
Training: loss 5.9145, time 37337.25ms: : 401it [06:31, 1.02it/s]

Where do I place the model? App.py is not working at the moment.

Traceback (most recent call last):
File "/content/metavoice-src/app1.py", line 12, in
from fam.llm.sample import (
ModuleNotFoundError: No module named 'fam.llm.sample'

vatsalaggarwal added the feature request New feature or request label Mar 12, 2024

Ar4ikov mentioned this issue May 21, 2024

Finetuining 1B first-stage on non-English datasets: thoughts #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rough fine-tuning guidance #70

Rough fine-tuning guidance #70

RonanKMcGovern commented Feb 21, 2024

vatsalaggarwal commented Feb 21, 2024 •

edited

RonanKMcGovern commented Feb 22, 2024

vatsalaggarwal commented Feb 22, 2024 •

edited

deeprobo-dev commented Feb 26, 2024 •

edited

hrachkovinovoto commented Feb 27, 2024

vatsalaggarwal commented Feb 27, 2024 •

edited

deeprobo-dev commented Mar 1, 2024 •

edited

maepopi commented Mar 1, 2024

RonanKMcGovern commented Mar 1, 2024

maepopi commented Mar 1, 2024 •

edited

danablend commented Mar 3, 2024

vatsalaggarwal commented Mar 3, 2024

danablend commented Mar 4, 2024 •

edited

vatsalaggarwal commented Mar 4, 2024 •

edited

danablend commented Mar 4, 2024

danablend commented Mar 4, 2024

G-force78 commented Mar 14, 2024 •

edited

Rough fine-tuning guidance #70

Rough fine-tuning guidance #70

Comments

RonanKMcGovern commented Feb 21, 2024

vatsalaggarwal commented Feb 21, 2024 • edited

RonanKMcGovern commented Feb 22, 2024

vatsalaggarwal commented Feb 22, 2024 • edited

deeprobo-dev commented Feb 26, 2024 • edited

hrachkovinovoto commented Feb 27, 2024

vatsalaggarwal commented Feb 27, 2024 • edited

deeprobo-dev commented Mar 1, 2024 • edited

maepopi commented Mar 1, 2024

RonanKMcGovern commented Mar 1, 2024

maepopi commented Mar 1, 2024 • edited

danablend commented Mar 3, 2024

vatsalaggarwal commented Mar 3, 2024

danablend commented Mar 4, 2024 • edited

vatsalaggarwal commented Mar 4, 2024 • edited

danablend commented Mar 4, 2024

danablend commented Mar 4, 2024

G-force78 commented Mar 14, 2024 • edited

vatsalaggarwal commented Feb 21, 2024 •

edited

vatsalaggarwal commented Feb 22, 2024 •

edited

deeprobo-dev commented Feb 26, 2024 •

edited

vatsalaggarwal commented Feb 27, 2024 •

edited

deeprobo-dev commented Mar 1, 2024 •

edited

maepopi commented Mar 1, 2024 •

edited

danablend commented Mar 4, 2024 •

edited

vatsalaggarwal commented Mar 4, 2024 •

edited

G-force78 commented Mar 14, 2024 •

edited