Feature Request Thread #467

dathudeptrai · 2021-01-20T08:01:28Z

Don't hesitate to tell me what features you want in this repo :)))

unparalleled-ysj · 2021-01-21T01:27:50Z

@dathudeptrai What do you think of voice cloning

mikus · 2021-01-21T23:00:50Z

I would like to see better componentization. There are similar blocks (groups of layers) implemented multiple times, like positional encoding, speaker encoding or postnet. Others relates on configuration specific just for one particular network like self-attention block used in FastSpeech. With a little rework, making those block more generic it would be easier to create new network types. Similarly with losses, e.g. training for hifigan contains many duplicated code from mb-melgan. Moreover, most of the training and inference scripts looks quite similar and I believe they can be refactored too to, once again, compose the final solution from more generic components.

And BTW, I really appreciate your work and think you did a great job! :)

dathudeptrai · 2021-01-22T02:12:52Z

training for hifigan contains many duplicated code from mb-melgan

Hmm, in this case, users just need to read and understand hifigan without reading mb-melgan.

ZDisket · 2021-01-22T02:29:30Z

@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?

unparalleled-ysj · 2021-01-22T03:00:21Z

@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?
For example, given a short segment of the target person’s voice, the model does not need to be retrained to synthesize the voice of the speaker’s timbre, such as using voiceprint technology to extract speaker embedding to train a multi-speaker TTS model

ZDisket · 2021-01-22T03:03:35Z

@unparalleled-ysj That's what I was thinking about. Relevantly, @dathudeptrai I saw https://github.com/dipjyoti92/SC-WaveRNN, could SC-MB-MelGAN be possible?

luan78zaoha · 2021-01-22T04:02:42Z

@unparalleled-ysj @ZDisket That is also what I’m doing. I'm trying to train a multi-speaker fastspeech2 model replacing current hardcoding speaker-ID with bottleneck feature extracted by a voiceprint model. The bottleneck feature of continuous softcoding represents a speaker-related space. If the unknown voice is similar to a voice in the training space, voice cloning may be realized. But judging from the results of current open source projects, it is a difficult problem and certainly not as simple as I described. Do you have any good ideas?

mikus · 2021-02-15T09:33:10Z

One possible option for better support for multiple speakers or styles would be to add a Variable Auto-Encoder which automatically extracts this voice/style "fingerprint".

abylouw · 2021-02-24T21:00:28Z

LightSpeech https://arxiv.org/abs/2102.04040

nmfisher · 2021-02-28T15:18:17Z

@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech

Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).

This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).

Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.

Some samples @ 170k (decoded with pre-trained MB-MelGan):

https://github.com/nmfisher/lightspeech_samples/tree/main/v1_170k

Noticeably worse quality than FastSpeech 2 at the same number of training steps, and it's falling apart on longer sequences.

dathudeptrai · 2021-02-28T16:02:33Z

@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech

Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).

This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).

Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.

great! :D. how about a number of parameters in LightSpeech ?

nmfisher · 2021-03-01T02:41:59Z

My early version of LightSpeech is:

By comparison, FastSpeech 2 (v1) is:

But given the paper claims 1.8M parameters for LightSpeech (vs 27M for FastSpeech 2), my implementation obviously still isn't 100% accurate. Feedback from the authors will help clarify the number of attention heads (and also the hidden size of each head).

Also I think the paper didn't implement PostNet, so removing that layer immediately eliminates ~4.3M parameters.

luan78zaoha · 2021-03-01T04:05:57Z

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

dathudeptrai · 2021-03-01T04:08:35Z

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

dathudeptrai · 2021-03-01T04:11:23Z

@nmfisher 6M params is small enough, did you get a good result with lighspeech ? . how fast is it ?

luan78zaoha · 2021-03-01T04:38:10Z

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

dathudeptrai · 2021-03-01T05:10:07Z

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

@luan78zaoha lightspeech use separableConvolution :D.

luan78zaoha · 2021-03-01T05:10:30Z

@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 0.018 and 0.01, respectively.

dathudeptrai · 2021-03-01T05:11:49Z

@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 55.6 and 98.0, respectively.

let wait @luan78zaoha reports lightspeech RTF :D.

nmfisher · 2021-03-01T19:07:26Z

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

As @dathudeptrai mentioned, LightSpeech uses SeparableConvolution in place of regular Convolution, but then also passes various FastSpeech2 configurations through neural architecture search to determine the best configuration of kernel sizes/attention heads/attention dimensions. Basically they use NAS to find the smallest configuration that performs as well as FastSpeech2.

debasish-mihup · 2021-03-09T07:57:53Z

@dathudeptrai @xuefeng Can you help me implement Higan with fastspeech2 on android? I have tried to implement the same by using https://github.com/tulasiram58827/TTS_TFLite/tree/main/models pretrained model and changing the line

TensorFlowTTS/examples/android/app/src/main/java/com/tensorspeech/tensorflowtts/module/FastSpeech2.java

Line 73 in 9a107d9

int[] shape = {1, outputBuffer.position() / size, size};

to handle the input model data shape. But the output is pure noise.

StuartIanNaylor · 2021-03-18T03:52:34Z

Not really a request but just wondering about the use of Librosa.
I have been playing around with https://github.com/google-research/google-research/tree/master/kws_streaming which uses internal methods for MFCC.
The one it uses is the pyhon.ops one but the tf.signal was also quite a perf boost on using librosa.

Is there any reason for librosa over say tf.signal.stft and tf.signal.linear_to_mel_weight_matrix as they seem extremely performant?

Collin-Budrick · 2021-04-07T05:58:01Z

@dathudeptrai What do you think of voice cloning

I have no doubt this project would work wonders on Voice cloning.

StuartIanNaylor · 2021-04-10T13:49:25Z

With the fastspeech tflite model is it possible to covert to run on a Edge TPU?
If so any examples how?

zero15 · 2021-04-14T07:43:32Z

Will tacotron2 support full integer quantization in tflite?
Current model use full interger quantization failed with "pybind11::init(): factory function returned nullptr." It's likely because the model has multi subgraphs

ZDisket · 2021-04-30T06:01:32Z

@dathudeptrai Can you help with implementing forced alignment attention loss for Tacotron2 like in this paper? I've managed to turn MFA durations into alignments and put them in the dataloader, but replacing the regular guided attention loss only makes the attention learning worse, both finetuning and from scratch according to eval results after 1k steps, when in the paper the PAG one should be winning

dathudeptrai · 2021-05-04T03:12:33Z

@ZDisket let me read the paper first :D.

ZDisket · 2021-05-04T03:30:45Z

@dathudeptrai Since that post I discovered that MAE loss between the generated and forced attention works to guide it, but it's so strong that it ends up hurting performance, which could be fixed with a low enough multiplier like 0.01, although I haven't tested it extensively as I abandoned it in favor of training a universal vocoder with a trick.

alexdemartos · 2021-06-17T20:42:44Z

This looks really interesting:

https://arxiv.org/pdf/2106.03167v2.pdf

ZDisket · 2021-06-19T21:44:48Z

@tts-nlp That looks like an implementation of Algorithm 1. For the 2nd and third, they mention a shift time transform

In order to obtain the shift time transform, the convolution technique was applied after
obtaining a DFT matrix or a fourier basis matrix in most implementations. OLA was applied to obtain the inverse transform

rgzn-aiyun · 2021-07-18T09:58:55Z

@dathudeptrai What do you think of voice cloning

Hey, I've seen a project about voice cloning recently.

rgzn-aiyun · 2021-07-18T09:59:45Z

This looks really interesting:

https://github.com/KuangDD/zhrtvc

stale · 2021-09-16T10:03:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

abylouw · 2022-04-28T17:45:15Z

Anybody working on VQTTS?

qxde01 · 2022-11-29T09:26:48Z

I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model，then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?

In addition, TensorFlowTTS project is not very active, not updated for more than a year.

StuartIanNaylor · 2022-11-29T22:39:21Z

I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model，then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?

In addition, TensorFlowTTS project is not very active, not updated for more than a year.

Just been looking at wenet but haven't really made an appraisal but so far seems 'very kaldi' :)

dathudeptrai added enhancement 🚀 New feature or request Feature Request 🤗 Feature support help wanted 🧐 Extra attention is needed labels Jan 20, 2021

dathudeptrai pinned this issue Jan 20, 2021

stale bot added the wontfix label Sep 16, 2021

stale bot closed this as completed Sep 23, 2021

Feature Request Thread #467

Feature Request Thread #467

Comments

dathudeptrai commented Jan 20, 2021 • edited

unparalleled-ysj commented Jan 21, 2021

mikus commented Jan 21, 2021

dathudeptrai commented Jan 22, 2021

ZDisket commented Jan 22, 2021

unparalleled-ysj commented Jan 22, 2021

ZDisket commented Jan 22, 2021

luan78zaoha commented Jan 22, 2021

mikus commented Feb 15, 2021 • edited

abylouw commented Feb 24, 2021

nmfisher commented Feb 28, 2021 • edited

dathudeptrai commented Feb 28, 2021

nmfisher commented Mar 1, 2021

luan78zaoha commented Mar 1, 2021

dathudeptrai commented Mar 1, 2021

dathudeptrai commented Mar 1, 2021

luan78zaoha commented Mar 1, 2021

dathudeptrai commented Mar 1, 2021

luan78zaoha commented Mar 1, 2021 • edited

dathudeptrai commented Mar 1, 2021

nmfisher commented Mar 1, 2021

debasish-mihup commented Mar 9, 2021 • edited

StuartIanNaylor commented Mar 18, 2021 • edited

Collin-Budrick commented Apr 7, 2021

StuartIanNaylor commented Apr 10, 2021

zero15 commented Apr 14, 2021

ZDisket commented Apr 30, 2021 • edited

dathudeptrai commented May 4, 2021

ZDisket commented May 4, 2021 • edited

alexdemartos commented Jun 17, 2021

ZDisket commented Jun 19, 2021

rgzn-aiyun commented Jul 18, 2021

rgzn-aiyun commented Jul 18, 2021

stale bot commented Sep 16, 2021

abylouw commented Apr 28, 2022

qxde01 commented Nov 29, 2022

StuartIanNaylor commented Nov 29, 2022

dathudeptrai commented Jan 20, 2021 •

edited

mikus commented Feb 15, 2021 •

edited

nmfisher commented Feb 28, 2021 •

edited

luan78zaoha commented Mar 1, 2021 •

edited

debasish-mihup commented Mar 9, 2021 •

edited

StuartIanNaylor commented Mar 18, 2021 •

edited

ZDisket commented Apr 30, 2021 •

edited

ZDisket commented May 4, 2021 •

edited