Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FreeVC implementation #201

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add FreeVC implementation #201

wants to merge 8 commits into from

Conversation

Nugine
Copy link

@Nugine Nugine commented May 7, 2024

✨ Description

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

This PR is a part of AIR6063 final project.

FYI, we also have another repo which refactors the training pipeline. Both the PR code and the custom code can produce good checkpoints.

Here are our checkpoints trained with PR code on single NVIDIA RTX4090

  • 120000 steps, 183 epochs, 14.53 hours for each ckpt (freevc, freevc-s, freevc-nosr)
  • 290000 steps, 443 epochs, 35.98 hours for each ckpt (freevc, freevc-s, freevc-nosr)
  • 300000 steps, 458 epochs, 36.53 hours for each ckpt (freevc)

🚧 Related Issues

During the project, we have opened some issues and another PR to help improve Amphion.

👨‍💻 Changes Proposed

  • Add FreeVC in models
  • Add FreeVC in egs

🧑‍🤝‍🧑 Who Can Review?

[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.]
@zhizhengwu @RMSnow @Adorable-Qin

✅ Checklist

  • Code has been reviewed
  • Code complies with the project's code standards and best practices
  • Code has passed all tests
  • Code does not affect the normal use of existing features
  • Code has been commented properly
  • Documentation has been updated (if applicable)
  • Demo/checkpoint has been attached (if applicable)

@SeanYouLaw
Copy link

SeanYouLaw commented May 7, 2024

Here are some examples of our results:

1_src.mp4
1_dst.mp4
1_output.mp4

2_src.mp4
2_dst.mp4
2_output.mp4

3_src.mp4
3_dst.mp4
3_output.mp4

@RMSnow
Copy link
Collaborator

RMSnow commented May 7, 2024

The quality of the samples sounds good. @Adorable-Qin Please check the code and document carefully.

@RMSnow RMSnow mentioned this pull request May 7, 2024
7 tasks
@SeanYouLaw
Copy link

SeanYouLaw commented May 8, 2024

Here are some examples of our results, using the checkpoint of 183 epoch(120k steps) training(while above examples are from the pretrained checkpoint):

1_src.mp4
1_tgt.mp4
1_output.mp4
2_src.mp4
2_tgt.mp4
2_output.mp4
3_src.mp4
3_tgt.mp4
3_output.mp4

@Nugine
Copy link
Author

Nugine commented May 8, 2024

Our AutoDL server will expire tomorrow. Here is a demo video recording the training status.

demo-video.mp4

@lmxue lmxue requested review from ArkhamImp, HarryHe11 and Adorable-Qin and removed request for HarryHe11 May 10, 2024 16:03

@torch.no_grad()
def load_sample(self, filename):
filepath = os.path.join(self.vctk_16k_dir, filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line hard-coded? Is it possible to select datasets in config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation only trains on VCTK dataset.

  • Data preprocessing uses the file structure of VCTK dataset to retrieve speaker tags.
  • When splitting train/val/test set, every speaker's samples are split randomly. It ensures that every speaker is in train & val & test set.

It's possible to support other datasets if we can perform the same operations on them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation!
@RMSnow For this implementation, do we expect a universal model that can be trained on any dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole directory (models/vc/FreeVC/speaker_encoder) is copied from

We keep it unchanged to match the original implementation.
However, it may be a problem if we copy so much code and a pretrained ckpt from other repo. I'm not sure what is the best practice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation.
@RMSnow Any advice about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to support multi-GPU training using external library like the Accelerate used in Amphion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tried multi-GPU training in another repo. We use lightning framework to automatically enable DDP training. But it exits with error soon after starting. Single GPU works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants