Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request Thread #467

Closed
dathudeptrai opened this issue Jan 20, 2021 · 36 comments
Closed

Feature Request Thread #467

dathudeptrai opened this issue Jan 20, 2021 · 36 comments
Labels
enhancement 🚀 New feature or request Feature Request 🤗 Feature support help wanted 🧐 Extra attention is needed wontfix

Comments

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Jan 20, 2021

Don't hesitate to tell me what features you want in this repo :)))

@dathudeptrai dathudeptrai added enhancement 🚀 New feature or request Feature Request 🤗 Feature support help wanted 🧐 Extra attention is needed labels Jan 20, 2021
@dathudeptrai dathudeptrai pinned this issue Jan 20, 2021
@unparalleled-ysj
Copy link

@dathudeptrai What do you think of voice cloning

@mikus
Copy link

mikus commented Jan 21, 2021

I would like to see better componentization. There are similar blocks (groups of layers) implemented multiple times, like positional encoding, speaker encoding or postnet. Others relates on configuration specific just for one particular network like self-attention block used in FastSpeech. With a little rework, making those block more generic it would be easier to create new network types. Similarly with losses, e.g. training for hifigan contains many duplicated code from mb-melgan. Moreover, most of the training and inference scripts looks quite similar and I believe they can be refactored too to, once again, compose the final solution from more generic components.

And BTW, I really appreciate your work and think you did a great job! :)

@dathudeptrai
Copy link
Collaborator Author

training for hifigan contains many duplicated code from mb-melgan

Hmm, in this case, users just need to read and understand hifigan without reading mb-melgan.

@ZDisket
Copy link
Collaborator

ZDisket commented Jan 22, 2021

@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?

@unparalleled-ysj
Copy link

@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?
For example, given a short segment of the target person’s voice, the model does not need to be retrained to synthesize the voice of the speaker’s timbre, such as using voiceprint technology to extract speaker embedding to train a multi-speaker TTS model

@ZDisket
Copy link
Collaborator

ZDisket commented Jan 22, 2021

@unparalleled-ysj That's what I was thinking about. Relevantly, @dathudeptrai I saw https://github.com/dipjyoti92/SC-WaveRNN, could SC-MB-MelGAN be possible?

@luan78zaoha
Copy link
Contributor

@unparalleled-ysj @ZDisket That is also what I’m doing. I'm trying to train a multi-speaker fastspeech2 model replacing current hardcoding speaker-ID with bottleneck feature extracted by a voiceprint model. The bottleneck feature of continuous softcoding represents a speaker-related space. If the unknown voice is similar to a voice in the training space, voice cloning may be realized. But judging from the results of current open source projects, it is a difficult problem and certainly not as simple as I described. Do you have any good ideas?

@mikus
Copy link

mikus commented Feb 15, 2021

One possible option for better support for multiple speakers or styles would be to add a Variable Auto-Encoder which automatically extracts this voice/style "fingerprint".

@abylouw
Copy link
Contributor

abylouw commented Feb 24, 2021

LightSpeech https://arxiv.org/abs/2102.04040

@nmfisher
Copy link

nmfisher commented Feb 28, 2021

@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech

Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).

This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).

Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.

Some samples @ 170k (decoded with pre-trained MB-MelGan):

https://github.com/nmfisher/lightspeech_samples/tree/main/v1_170k

Noticeably worse quality than FastSpeech 2 at the same number of training steps, and it's falling apart on longer sequences.

@dathudeptrai
Copy link
Collaborator Author

@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech

Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).

This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).

Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.

great! :D. how about a number of parameters in LightSpeech ?

@nmfisher
Copy link

nmfisher commented Mar 1, 2021

My early version of LightSpeech is:
image

By comparison, FastSpeech 2 (v1) is:

image

But given the paper claims 1.8M parameters for LightSpeech (vs 27M for FastSpeech 2), my implementation obviously still isn't 100% accurate. Feedback from the authors will help clarify the number of attention heads (and also the hidden size of each head).

Also I think the paper didn't implement PostNet, so removing that layer immediately eliminates ~4.3M parameters.

@luan78zaoha
Copy link
Contributor

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

@dathudeptrai
Copy link
Collaborator Author

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

@dathudeptrai
Copy link
Collaborator Author

@nmfisher 6M params is small enough, did you get a good result with lighspeech ? . how fast is it ?

@luan78zaoha
Copy link
Contributor

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

@dathudeptrai
Copy link
Collaborator Author

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

@luan78zaoha lightspeech use separableConvolution :D.

@luan78zaoha
Copy link
Contributor

luan78zaoha commented Mar 1, 2021

@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 0.018 and 0.01, respectively.

@dathudeptrai
Copy link
Collaborator Author

@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 55.6 and 98.0, respectively.

let wait @luan78zaoha reports lightspeech RTF :D.

@nmfisher
Copy link

nmfisher commented Mar 1, 2021

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

As @dathudeptrai mentioned, LightSpeech uses SeparableConvolution in place of regular Convolution, but then also passes various FastSpeech2 configurations through neural architecture search to determine the best configuration of kernel sizes/attention heads/attention dimensions. Basically they use NAS to find the smallest configuration that performs as well as FastSpeech2.

@debasish-mihup
Copy link

debasish-mihup commented Mar 9, 2021

@dathudeptrai @xuefeng Can you help me implement Higan with fastspeech2 on android? I have tried to implement the same by using https://github.com/tulasiram58827/TTS_TFLite/tree/main/models pretrained model and changing the line

to handle the input model data shape. But the output is pure noise.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Mar 18, 2021

Not really a request but just wondering about the use of Librosa.
I have been playing around with https://github.com/google-research/google-research/tree/master/kws_streaming which uses internal methods for MFCC.
The one it uses is the pyhon.ops one but the tf.signal was also quite a perf boost on using librosa.

Is there any reason for librosa over say tf.signal.stft and tf.signal.linear_to_mel_weight_matrix as they seem extremely performant?

@Collin-Budrick
Copy link

@dathudeptrai What do you think of voice cloning

I have no doubt this project would work wonders on Voice cloning.

@StuartIanNaylor
Copy link

With the fastspeech tflite model is it possible to covert to run on a Edge TPU?
If so any examples how?

@zero15
Copy link

zero15 commented Apr 14, 2021

Will tacotron2 support full integer quantization in tflite?
Current model use full interger quantization failed with "pybind11::init(): factory function returned nullptr." It's likely because the model has multi subgraphs

@ZDisket
Copy link
Collaborator

ZDisket commented Apr 30, 2021

@dathudeptrai Can you help with implementing forced alignment attention loss for Tacotron2 like in this paper? I've managed to turn MFA durations into alignments and put them in the dataloader, but replacing the regular guided attention loss only makes the attention learning worse, both finetuning and from scratch according to eval results after 1k steps, when in the paper the PAG one should be winning

@dathudeptrai
Copy link
Collaborator Author

@ZDisket let me read the paper first :D.

@ZDisket
Copy link
Collaborator

ZDisket commented May 4, 2021

@dathudeptrai Since that post I discovered that MAE loss between the generated and forced attention works to guide it, but it's so strong that it ends up hurting performance, which could be fixed with a low enough multiplier like 0.01, although I haven't tested it extensively as I abandoned it in favor of training a universal vocoder with a trick.
1_alignment
2_alignment

@alexdemartos
Copy link

This looks really interesting:

https://arxiv.org/pdf/2106.03167v2.pdf

@ZDisket
Copy link
Collaborator

ZDisket commented Jun 19, 2021

@tts-nlp That looks like an implementation of Algorithm 1. For the 2nd and third, they mention a shift time transform

In order to obtain the shift time transform, the convolution technique was applied after
obtaining a DFT matrix or a fourier basis matrix in most implementations. OLA was applied to obtain the inverse transform

@rgzn-aiyun
Copy link

@dathudeptrai What do you think of voice cloning

Hey, I've seen a project about voice cloning recently.

@rgzn-aiyun
Copy link

This looks really interesting:

https://github.com/KuangDD/zhrtvc

@stale
Copy link

stale bot commented Sep 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Sep 16, 2021
@stale stale bot closed this as completed Sep 23, 2021
@abylouw
Copy link
Contributor

abylouw commented Apr 28, 2022

Anybody working on VQTTS?

@qxde01
Copy link

qxde01 commented Nov 29, 2022

I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?

In addition, TensorFlowTTS project is not very active, not updated for more than a year.

@StuartIanNaylor
Copy link

I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?

In addition, TensorFlowTTS project is not very active, not updated for more than a year.

Just been looking at wenet but haven't really made an appraisal but so far seems 'very kaldi' :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🚀 New feature or request Feature Request 🤗 Feature support help wanted 🧐 Extra attention is needed wontfix
Projects
None yet
Development

No branches or pull requests