Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rough fine-tuning guidance #70

Open
RonanKMcGovern opened this issue Feb 21, 2024 · 17 comments
Open

Rough fine-tuning guidance #70

RonanKMcGovern opened this issue Feb 21, 2024 · 17 comments
Labels
feature request New feature or request

Comments

@RonanKMcGovern
Copy link

I know the repo ReadMe says "soon", but would it be possible to give some very rough advice on how to fine-tune to improve on the voice's match with a custom speaker?

I guess the demo is just extracting embeddings from bria.mp3, but I'd like to go one step further to get a better voice match. Thanks.

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Feb 21, 2024

Hey! Yes, when we released the repo, we thought it'd be the next thing we release, but we've reprioritised, so are busy with other things, and we've not actively started working on releasing finetuning support.

In terms of the voice cloning, have you tried to embed the voice you're trying to clone? What was the issue?

In general, we've found that the following works best:

  • >= 30seconds
  • non-compressed (i.e. not been run through mp3 compression, etc), high bandwidth audio
  • high SNR / low reverb (no background noise, close to the mic, etc)
  • american speakers

In terms of what's required for finetuning, there are 4 models chained together: i) first stage (from text -> 2 hierarchies of encodec), ii) second stage (2 hierarchies of encodec -> rest 6 hierarchies of encodec), iii) mbd (8 hierarchies of encodec -> waveform), iv) deepfilternet (cleanup; waveform -> waveform). In our testing, we found that the second stage is fairly robust to speakers, and accents; we haven't extensively tested it for non-English languages. So depending on what you're trying to do, I'd recommend focusing on finetuning primarily the first stage model 1B param model.

For that, we need:

I might be missing a few things here as I'm putting this down from memory, but happy to assist with things as they come up. Sorry about the delay on this from our end, but we equally welcome contributions, and would move to support that instead!

@RonanKMcGovern
Copy link
Author

Many thanks @vatsalaggarwal .

Re embeddings, I used a 90 s recording of my voice. I used an mp3 file, so perhaps I could have done better there, but I'm Irish so maybe that was the issue in getting a good match. I'll try out an American person's embeddings instead and see.

Yeah, I think I've got it on training. I guess the first stage produces encodec tokens? What do I use to decode those?

BTW, what model are you using for diffusion in the third stage, I don't see that on the model card on HF, but maybe I glanced over it? Thanks

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Feb 22, 2024

Yeah, and a 44.1/48khz wav or 256kbps mp3 works better… but, it’s unlikely the model would be able to produce an Irish accent regardless in one-shot, at least we didn’t focus on it for this release.

The first stage produces the first two codebooks of the encodec RVQ.

you can decode these using the second-stage + MBD + deepfilternet.

We are using multi band diffusion from audio craft.

More details are available in https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#architecture and #70 (comment)

@deeprobo-dev
Copy link

deeprobo-dev commented Feb 26, 2024

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

Thanks for the amazing project

@hrachkovinovoto
Copy link

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

Thanks for the amazing project

+1, It'd be great if it manages to run on 8 GB VRAM at the very least.

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Feb 27, 2024

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

I think you'd need to finetune both the first stage for sure, and I'm not sure if second stage needs to be finetuned, that's something you'd have to check.

Thanks for the request regarding faster inference and on-device! How much VRAM and compute (TFLOPs) are available?

@deeprobo-dev
Copy link

deeprobo-dev commented Mar 1, 2024

I think you'd need to finetune both the first stage for sure, and I'm not sure if second stage needs to be finetuned, that's something you'd have to check.

Thanks for the request regarding faster inference and on-device! How much VRAM and compute (TFLOPs) are available?

Thanks for your insights regarding fine tuning. I will give it a try. For NVIDIA JETSON AGX XAVIER its of 32 TFLOPS and for NVIDIA JETSON AGX ORIN is of 275 TFLOPS.

@maepopi
Copy link

maepopi commented Mar 1, 2024

Hello! If I understood correctly from this thread, it will soon be possible to finetune a model both with a model checkpoint and a LoRA with 12Gb VRAM (and possibly 8gb?)? How large the audio dataset should be for each?

thank you, this is very exciting!

@RonanKMcGovern
Copy link
Author

Hey! Yes, when we released the repo, we thought it'd be the next thing we release, but we've reprioritised, so are busy with other things, and we've not actively started working on releasing finetuning support.

Just to add to this. I wouldn't underestimate how valuable it would be to release fine-tuning in a simple way. I'm not aware of frameworks that accurately can fine-tune for a specific voice. I'd be keen to make a video on this on youtube.com/@TrelisResearch if this becomes possible - it allows for things like making your own audio books.

@maepopi
Copy link

maepopi commented Mar 1, 2024

Just to add to this. I wouldn't underestimate how valuable it would be to release fine-tuning in a simple way. I'm not aware of frameworks that accurately can fine-tune for a specific voice. I'd be keen to make a video on this on youtube.com/@TrelisResearch if this becomes possible - it allows for things like making your own audio books.

If I might, have you checked out this tool? it's based off Tortoise-TTS and it's really good. I've been playing around with it for months and come up with pretty good models. I don't think it supports LoRA's though, and I'm starting to think that maybe you need rather large datasets for finetuning, which is why I'm very interested in the present repo. In addition, Metavoice seems to provide with a slightly better base model than Tortoise (but this would need to be tested further, it's just an impression for now).

@danablend
Copy link

Yeah, and a 44.1/48khz wav or 256kbps mp3 works better… but, it’s unlikely the model would be able to produce an Irish accent regardless in one-shot, at least we didn’t focus on it for this release.

The first stage produces the first two codebooks of the encodec RVQ.

you can decode these using the second-stage + MBD + deepfilternet.

We are using multi band diffusion from audio craft.

More details are available in https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#architecture and #70 (comment)

Hey! Thanks very much for your insight.

I'm in the midst of attempting to implement fine tuning, and I've gotten a very simple script to train but I could only get it to work by sequentially iterating over each data entry in a batch.

Would you happen to have an idea of how it might look adding batch inference support, so it can train on batches at a time during training / fine tuning?

Again, very much appreciate your work!

@vatsalaggarwal
Copy link
Contributor

@danablend can you push your code to PR? I can have a look

@danablend
Copy link

danablend commented Mar 4, 2024

@danablend can you push your code to PR? I can have a look

Hey @vatsalaggarwal, I realized that I had made a mistake and needed to build more code to make the training work.

I've spent a few hours working on it, but I get OOM when attempting to train the model with the gradients enabled on an A10G (16 GB VRAM), so I don't know yet if it works to push to the codebase. How many GB VRAM did you find you needed to train the model?

I'll play around with it more and see if I can prepare something useful and clean for you as a base to work off if that would be helpful, and I can open that as a PR?

  • Probably going to play with PEFT / LoRA approach, given my current resource limitations (someone did it here with LLaMA and VALL-E: https://arxiv.org/pdf/2401.00246, I will try a similar approach)

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Mar 4, 2024

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look at it in it's current state (even if it's in a super dirty state) and give pointers to speed you up.

@danablend
Copy link

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look it in it's current state and give pointers to speed you up even if it's in a super dirty state.

I'll open a PR shortly

@danablend
Copy link

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look at it in it's current state (even if it's in a super dirty state) and give pointers to speed you up.

Just added a PR draft (#82).

@vatsalaggarwal vatsalaggarwal added the feature request New feature or request label Mar 12, 2024
@G-force78
Copy link

G-force78 commented Mar 14, 2024

Thanks for all the work on the training script its way beyond my ability, I am testing it now and am not sure what --val should be pathed to? Is it the dataset csv file? I have also pathed --train to that as well.
Also, is there a way to set learning rate and number of steps? (Found it, its in .fam/llm/config/finetune_params.py).
From experience using tortoise I found 2000-2500 steps was a range to aim for when training 20minutes of clean audio with no silences.

Edit: ok so that was correct you have to set the argument --train dataset.csv --val valdataset.csv
No how to save every so often eg 1/4 of way 1/2 way then 3/4 to completion.
so far so good...using t4 on google colab
memory

Training: loss 5.9145, time 37337.25ms: : 401it [06:31, 11.51s/it]iter 400: loss 5.9145, time 37337.25ms
Training: loss 5.9145, time 37337.25ms: : 401it [06:31, 1.02it/s]

Where do I place the model? App.py is not working at the moment.

Traceback (most recent call last):
File "/content/metavoice-src/app1.py", line 12, in
from fam.llm.sample import (
ModuleNotFoundError: No module named 'fam.llm.sample'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants