Skip to content

dwgnr/speech-conversion

Repository files navigation

Whisper to Normal Speech Conversion with SC-MelGAN and SC-VQ-VAE

This repository contains the source code for the paper Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech. The goal was to adapt MelGAN and VQ-VAE systems to convert whispered speech into normal speech.

The MelGAN code used as basis for this project can be found here.

The VQ-VAE model is based on Deepmind's VQ-VAE implementation (see here), Andrej Karpathy's implementation, and this repo.

The WaveGlow system is a slightly adapted version of the code provided by NVIDIA.

Please visit our demo website for samples.

Structure of this Repository

The repo is structured as follows:

    .
    └── speech-conversion
        ├── melgan             -> Sources for training MelGAN models
        │   └── mel2wav
        ├── vqvae              -> Sources for training VQ-VAE models
        └── waveglow           -> Sources for training WaveGlow models
            └── tacotron2

Dataset

The code is designed to be used with the wTIMIT corpus. The corpus can be downloaded here (Note: Requires authentication). The wTIMIT dataset is sampled at 44 kHz and needs to be resampled to 16 kHz. The 16 kHz setting is hardcoded in several places of this project. Hence, using a different sample rate without any source code modifications will likely lead to errors.

Preparing the Dataset

Create a directory with all samples stored for example in the wavs/ subfolder. You'll need to provide filelists containing the your test and training data. A simple way to create these filelists looks as follows:

ls wavs/*n.WAV | tail -n+10 > train_files.txt
ls wavs/*w.WAV | head -n10 > test_files.txt
ls wavs/*n.WAV | head -n10 > normal_test_files.txt -> normal test data for waveglow

Note that we only grab the whispered utterances (the ones with "w" at the end) for the test set.

Training

See the following scripts for examples on how to train MelGAN, VQ-VAE and WaveGlow models:

  • train_melgan.sh
    • Add your own paths to the variables SAVE_PATH, DATA_PATH, and LOAD_PATH
  • train_vqvae.sh
    • Add your own paths to the variables SAVE_PATH, DATA_PATH, LOAD_PATH, and WG_PATH
  • train_waveglow.sh
    • Create your own config file for the WaveGlow model or use an existing one and point to it via the --config flag
    • Note that the original WaveGlow model is incompatible with the Mel spectrogram features generated for MelGAN and VQ-VAE training
    • Hence using a pretrained WaveGlow model will not yield good results, when spectrogram inputs generated by VQ-VAE are used as input
    • A training script that provides a compatible model can be found in speech-conversion/waveglow/train_melgan_comapt.py and is also referenced in train_waveglow.sh.

Note: The Python scripts need to run with the -m command line flag and without the .py extension (e.g. python -m app.sub1.mod1) due to the relative imports used across the sub-packages.

Inference

Inference can be done with the following scripts:

  • inference_melgan.sh
  • inference_vqvae.sh