the use of "reference_audio" when inference #9

HandsLing · 2022-10-19T01:22:16Z

hi, I want to know what's the use of "reference_audio" when inference?

HandsLing · 2022-10-24T01:15:09Z

@tuanh123789 hi can you give me some help?

freshwindy · 2022-11-01T08:22:41Z

hi, I have the same question.Do you slove it?

cantabile-kwok · 2022-11-04T03:20:47Z

Maybe because AdaSpeech only has a phone-level predictor but not utterance-level, so in inference stage you still need to input a reference mel to get an utterance-level vector. I am not sure.

cantabile-kwok · 2022-11-04T05:26:39Z

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

freshwindy · 2022-11-07T07:02:55Z

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

cantabile-kwok · 2022-11-07T07:07:34Z

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.

freshwindy · 2022-11-07T08:05:49Z

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.

I used a speech-level encoder during training, but I removed the reference audio when synthesising the speech, does the speech-level encoder used have an effect on the final audio?

freshwindy · 2022-11-07T08:06:28Z

Isn't the inclusion of a discourse-level encoder further enriching the modelling information?

cantabile-kwok · 2022-11-07T11:12:25Z

@freshwindy When you removed the reference audio, do you mean replace that utterance-level vector with all 0? Because there still needs to be a vector that fills the blank. I haven't done any correspondent experiments yet. About the discourse level encoder, I can't come up with a reason why it won't enrich the modeling information, just the enrichment may not be so obvious to perceive, I'm not sure : )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the use of "reference_audio" when inference #9

the use of "reference_audio" when inference #9

HandsLing commented Oct 19, 2022

HandsLing commented Oct 24, 2022

freshwindy commented Nov 1, 2022

cantabile-kwok commented Nov 4, 2022

cantabile-kwok commented Nov 4, 2022

freshwindy commented Nov 7, 2022

cantabile-kwok commented Nov 7, 2022

freshwindy commented Nov 7, 2022

freshwindy commented Nov 7, 2022

cantabile-kwok commented Nov 7, 2022

the use of "reference_audio" when inference #9

the use of "reference_audio" when inference #9

Comments

HandsLing commented Oct 19, 2022

HandsLing commented Oct 24, 2022

freshwindy commented Nov 1, 2022

cantabile-kwok commented Nov 4, 2022

cantabile-kwok commented Nov 4, 2022

freshwindy commented Nov 7, 2022

cantabile-kwok commented Nov 7, 2022

freshwindy commented Nov 7, 2022

freshwindy commented Nov 7, 2022

cantabile-kwok commented Nov 7, 2022