Sidekit ASV Voxceleb 1
To run the recipe:
# Activate your miniconda env
. ./path.sh
# Download dataset to data
./local/data_prep.py --save-path ./data --download
# Create train data (recursive search of wavs) (Change `--from data` if you already have downloaded the wavs)
./local/data_prep.py --from ./data --make-train-data # set --filter-dir if your data dir structure differ from the '--download' one (e.g.: voxceleb1/wav/)
# Create test data
./local/data_prep.py --from ./data --make-test-data # set --filter-dir if your data dir structure differ from the '--download' one (e.g.: voxceleb1_test/wav/)
# Train
./local/train.py --config configs/...
Test Voxceleb-0 Exp Config
EER / min cllr 2.593 ± 0.0 / 0.106 exp/asv_eval_vox1_ecapa_tdnn configs/ecapa_tdnn
EER / min cllr 2.089 ± 0.408 / 0.105 exp/asv_eval_vox1_ecapa_tdnn_ft configs/ecapa_tdnn_fine_tune
EER / min cllr 2.413 ± 0.101 / 0.101 exp/asv_eval_vox1_resnet configs/resnet
Note: On VCTK, the resnet model seems to be better.
Note: ecapa_tdnn converges faster.
import torch
import torchaudio
waveform, _, text_gt, speaker, chapter, utterance = torchaudio.datasets.LIBRISPEECH("/tmp", "dev-clean", download=True)[0]
model = torch.jit.load("__Exp_Path__/final.jit")
model = model.eval()
_, x_vector = model(waveform)