Wav2Lip-HQ: high quality lip-sync

❗❗ This repository is deprecated, and no one maintains it at this moment. A lot has changed in the field since it was created, and many new instruments have emerged. Please, do not try to use this repository for your practical purposes. ❗❗

This is unofficial extension of Wav2Lip: Accurately Lip-syncing Videos In The Wild repository. We use image super resolution and face segmentation for improving visual quality of lip-synced videos.

Acknowledgements

Our work is to a great extent based on the code from the following repositories:

Clearly, Wav2Lip repository, that is a core model of our algorithm that performs lip-sync.
Moreover, face-parsing.PyTorch repository provides us with a model for face segmentation.
We also use extremely useful BasicSR respository for super resolution.
Finally, Wav2Lip heavily depends on face_alignment repository for detection.

The algorithm

Our algorithm consists of the following steps:

Pretrain ESRGAN on a video with some speech of a target person.
Apply Wav2Lip model to the source video and target audio, as it is done in official Wav2Lip repository.
Upsample the output of Wav2Lip with ESRGAN.
Use BiSeNet to change only relevant pixels in video.

You can learn more about the method in this article (in russian).

Results

Our approach is definetly not at all flawless, and some of the frames produced with it contain artifacts or weird mistakes. However, it can be used to perform lip-sync to high quality videos with plausible output.

Running the model

The simpliest way is to use our Google Colab demo. However, if you want to test the algorithm on your own machine, run the following commands. Beware that you need Python 3 and CUDA installed.

Clone this repository and install requirements:

git clone https://github.com/Markfryazino/wav2lip-hq.git
cd wav2lip-hq
pip3 install -r requirements.txt

Download all the .pth files from here and place them in checkpoints folder.

Apart from that, вownload the face detection model checkpoint:
```
!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"
```

Run the inference script:

!python inference.py \
    --checkpoint_path "checkpoints/wav2lip_gan.pth" \
    --segmentation_path "checkpoints/face_segmentation.pth" \
    --sr_path "checkpoints/esrgan_yunying.pth" \
    --face <path to source video> \
    --audio <path to source audio> \
    --outfile <desired path to output>

Finetuning super-resolution model.

Although we provide a checkpoint of pre-trained ESRGAN, it's training dataset was quite modest, so the results may be insufficient. Hence, it can be useful to finetune the model on your target video. 1 or 2 minutes of speech is usually enough.

To simplify finetuning the model, we provide a colab notebook. You can also run the commands listed there on your machine: namely, you have to download the models, run inference with saving all the frames on-the-fly, resize them and train ESRGAN.

Bear in mind that the procedure is quite time- and memory-consuming.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
basicsr		basicsr
checkpoints		checkpoints
face_detection		face_detection
face_parsing		face_parsing
images		images
results		results
temp		temp
videos		videos
wav2lip_models		wav2lip_models
README.md		README.md
audio.py		audio.py
hparams.py		hparams.py
inference.py		inference.py
requirements.txt		requirements.txt
train_basicsr.yml		train_basicsr.yml

Markfryazino/wav2lip-hq

Folders and files

Latest commit

History

Repository files navigation

Wav2Lip-HQ: high quality lip-sync

❗❗ This repository is deprecated, and no one maintains it at this moment. A lot has changed in the field since it was created, and many new instruments have emerged. Please, do not try to use this repository for your practical purposes. ❗❗

Acknowledgements

The algorithm

Results

Running the model

Finetuning super-resolution model.

About

Topics

Resources

Stars

Watchers

Forks

Languages