Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the use of "reference_audio" when inference #9

Open
HandsLing opened this issue Oct 19, 2022 · 9 comments
Open

the use of "reference_audio" when inference #9

HandsLing opened this issue Oct 19, 2022 · 9 comments

Comments

@HandsLing
Copy link

hi, I want to know what's the use of "reference_audio" when inference?

@HandsLing
Copy link
Author

@tuanh123789 hi can you give me some help?

@freshwindy
Copy link

hi, I have the same question.Do you slove it?

@cantabile-kwok
Copy link

Maybe because AdaSpeech only has a phone-level predictor but not utterance-level, so in inference stage you still need to input a reference mel to get an utterance-level vector. I am not sure.

@cantabile-kwok
Copy link

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

@freshwindy
Copy link

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

@cantabile-kwok
Copy link

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.

@freshwindy
Copy link

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.

I used a speech-level encoder during training, but I removed the reference audio when synthesising the speech, does the speech-level encoder used have an effect on the final audio?

@freshwindy
Copy link

Isn't the inclusion of a discourse-level encoder further enriching the modelling information?

@cantabile-kwok
Copy link

@freshwindy When you removed the reference audio, do you mean replace that utterance-level vector with all 0? Because there still needs to be a vector that fills the blank. I haven't done any correspondent experiments yet. About the discourse level encoder, I can't come up with a reason why it won't enrich the modeling information, just the enrichment may not be so obvious to perceive, I'm not sure : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants