Quick (Singing) Voice Conversion #200

CuiLvYing · 2024-05-07T10:28:09Z

✨ Description

This is an implementation of a simple Webui which provides a simple and quick text-free one-shot voice conversion for the uninitiated. Thereotically, the user only takes two short audios (source and target) and a few minutes to receive the VC result.
It purposes to use the base model (checkpoint) trained from the VCTK, M4Singer datasets (or other supported datasets) as a foundation, and then fine-tune the base model using the input source audio for voice conversion and output. Now it supports MultipleContentSVC and VITS.

🚧 Related Issues

None

👨‍💻 Changes Proposed

If exists, please refer to the commits.

🧑‍🤝‍🧑 Who Can Review?

[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.]
@zhizhengwu @RMSnow @Adorable-Qin

🛠 TODO

✅ Checklist

Code has been reviewed
Code complies with the project's code standards and best practices
Code has passed all tests
Code does not affect the normal use of existing features
Code has been commented properly
Documentation has been updated (if applicable)
Demo/checkpoint has been attached (if applicable)

RMSnow · 2024-05-07T11:16:10Z

Hi @CuiLvYing, thanks for your efforts! Would you please attach some demos (such as the generated voices or your WebUI's video) like PR #56?

CuiLvYing · 2024-05-07T12:27:53Z

Of course! Here are some test demo videos or audios.

1.mp4

2.mp4

source.mp4

https://github.com/open-mmlab/Amphion/assets/16

result.5.mp4

6400963/f752ea9d-a950-4831-bd30-ffd9fb6fd6f5

You can even have a look at our running demo webui now: https://24a8ca30d15dff216c.gradio.live
This test uses MultipleContentSVC and takes at least 200 seconds to output. However, I think our pre-trained model checkpoint has some flaws (not trained enough) and may not have a good effect, and sorry for that.

CuiLvYing · 2024-05-07T12:30:01Z

Sorry I find the using target audio not uploaded. Here is it:

target.mp4

RMSnow · 2024-05-07T14:40:11Z

Hi @CuiLvYing, I'm confused about your samples. For VC, the converted audio will speak the source's content with the target's timbre. Please use your model to convert the samples of PR: #201. Then we can compare yours :)

CuiLvYing · 2024-05-07T15:27:03Z

I think we are attempting to make the person from "Infsource" speak content of the "target", and this is just opposite to your definition, and we'll soon amend this.
Here are some audios after correction to the webui:

source1.mp4

target1.mp4

result1.mp4

source2.mp4

target2.mp4

result2.mp4

source3.mp4

target3.mp4

result3.mp4

RMSnow · 2024-05-08T06:23:49Z

The naturalness, especically the intelligibility, is bad to me. So I recommend not to merge this PR unless there is a substantial improvement. @Adorable-Qin Please review the code and document carefully.

CuiLvYing added 2 commits May 7, 2024 18:06

Add files via upload

25a6475

Add files via upload

96ed28d

RMSnow requested review from RMSnow, Adorable-Qin and HarryHe11 May 7, 2024 11:16

Update webui.py

d65dee9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick (Singing) Voice Conversion #200

Quick (Singing) Voice Conversion #200

CuiLvYing commented May 7, 2024

RMSnow commented May 7, 2024

CuiLvYing commented May 7, 2024

CuiLvYing commented May 7, 2024

RMSnow commented May 7, 2024

CuiLvYing commented May 7, 2024

RMSnow commented May 8, 2024

Quick (Singing) Voice Conversion #200

Are you sure you want to change the base?

Quick (Singing) Voice Conversion #200

Conversation

CuiLvYing commented May 7, 2024

✨ Description

🚧 Related Issues

👨‍💻 Changes Proposed

🧑‍🤝‍🧑 Who Can Review?

🛠 TODO

✅ Checklist

RMSnow commented May 7, 2024

CuiLvYing commented May 7, 2024

CuiLvYing commented May 7, 2024

RMSnow commented May 7, 2024

CuiLvYing commented May 7, 2024

RMSnow commented May 8, 2024