Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor results with: voice_conversion_models--multilingual--vctk--freevc24.zip CoquiTTS #89

Open
ballerburg9005 opened this issue Jan 28, 2024 · 0 comments

Comments

@ballerburg9005
Copy link

ballerburg9005 commented Jan 28, 2024

At first I was somewhat impressed, using a male voice as source and a female voice as target that was 30 seconds long and noise-cleaned by AI. It pretty much made the source wav sound like a prebubescent boy, similar to the target wav speaker, just not really as feminine as the target wav.

I am quite familiar how to create 30 second clean voice samples, so the voice transfer works very well on the commercial Coquitts website. Which I assume is using a variation of FreeVC.

But after this I tried many many different celebrity voices and such from males (similar or low pitched voices), and it all sounded like the same voice from some dude (who's voice wasn't all that manly) who was in neither of the provided sample wavs. There really was not much if any style transfer going on, if it concerns very basic fundamental parameters like the tone/undertone, pitch, rasp, etc. of the voice, i.e. what makes it recognizable the most. In that respect it sounded 90% just like this same dude all the time (presumably some voice used to train some TTS which the model takes as basis), and 9% like the source wav and 1% like the target wav (but it had all the nuances from the source wav and also would sometimes transfer nuances from the target wav, but only nuances). So you put in Duke Nukem + Duke Nukem, you always get = "this dude", who now speaks with boasty caricative intonations from the source wav (not the target wav), but otherwise his basic voice sounds nothing like Duke Nukem. I sometimes could recognize the orginal source wav's speaker voice stronger than other times, and sometimes the intonation was poor or there were artifacts.

I also noticed that bitrate 48000 works somewhat cleaner than 44100, but nothing else changed the fact, like mono 16k or what.

tts --model_name "voice_conversion_models/multilingual/vctk/freevc24" --source_wav untitled.wav --target_wav=input.wav --out_path=out.wav

Is this a bug, or is this not unusual?

The Coqui version of freevc24 is also substantially larger (1.6GB) and from March 2023 or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant