awesome-speaker-embedding

A curated list of speaker embedding/verification resources

Must-read papers

[01] Deep Speaker: an End-to-End Neural Speaker Embedding System, Baidu inc, 2017
[02] Text-Independent Speaker Verification Using 3D Convolutional Neural Networks, 2017
[03] Speaker Recognition from Raw Waveform with SincNet, Bengio team, raw waveform, 2018
[04] VoxCeleb2: Deep Speaker Recognition VGG group, Interspeech 2018
[05] Generalized End-to-End Loss for Speaker Verification, Google, ICASSP 2017
[06] Voxceleb: Large-scale speaker verification in the wild,VGG group, 2019
[07] Deep neural network embeddings for text-independent speaker verification, Interspeech 2017, original TDNN paper from Johns Hopkins , MFCC/frame-based/time-delay/multi-class, softmax + cross-entropy loss
[08] Robust DNN Embeddings for Speaker Recognition, ICASSP 2018, the X-vector paper Johns Hopkins, based on TDNN, improved by adding Noise and reverberation for augmentation
[09] Front-end factor analysis for speaker verification, 2011, IEEE TASLP, the 'i-vector' paper from Johns Hopkins
[10] TDNN-UBM Time delay deep recognition neural network-based universal background models for speaker , 2015
[11] Deep neural networks for small footprint text-dependent speaker verification, The 'D-vector' paper from Johns Hopkins
[12] Analysis of Score Normalization in Multilingual Speaker Recognition, Interspeech 2017, The S-norm paper, useful for score normalization

Benchmarks (not very accurate)

Results reported (by the authors) on Voxceleb1, VoxCeleb1-E and VoxCeleb1-H.

Voxceleb1 public results (continuously updating...)

Name	feature,model,activation/loss	VoxCeleb1	VoxCeleb1-E	VoxCeleb1-H	Link	Affiliation	Year
X205	DPN68,Res2Net50	0.7712%	0.8968%	1.637%	report	AISpeech	2020
Veridas	ResNet152	1.08%	-	-	report	das-nano	2020
DKU-DukeECE	Resnet,ECAPA-TDNN	0.888%	1.133%	2.008%	report	Duke University	2020
IDLAB	Resnet,ECAPA-TDNN	-	-	-	report	Ghent University -	2020
speechbrain	ECAPA-TDNN	0.69%	-	-	link	-	2021

Must-read technical reports

VOXSRC 2019 reports

Datasets

Commonly-used speaker datasets:

TIMIT: A small dataset for speaker and asr, non-free
Free ST: Mandarin speech corpus for speaker and asr, free
NIST SRE NIST Speaker Recognition Evaluation, non-free
AIShell-1: Mandarin speech corpus, divided into train/dev/test, free.
AIShell-2: free for education, non-free for commercial
AIShell-3: free, for speaker, asr and tts
AIShell-4, will be released soon
HI-MIA: free, for far-field text-dependent speaker verification and keyword spotting
SITW Speakers in the Wild,
Voxceleb 1&2, Celebrity interview video/audio extracted from Youtube
Cn-Celeb 1&2, Multi-genres speaker dataset in the wild, utterances are from chinese celebrities.

Challenges

Great Talks / Tutorials

X-vectors: Neural Speech Embeddings for Speaker Recognition, Daniel Garcia-Romero, 2020
2020声纹识别研究与应用学术讨论会

Code/Tools/Frameworks/Libraries

VGGVox The first baseline system for voxceleb dataset, originally implementated in Matlab.
DeepSpeaker An End-to-End Neural Speaker Embedding System.
SincNet, also in speechbrain
3D CNN TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification
GE2E, implementation is also in tensorlow
asv-subtools An Open Source Tools based on Pytorch and Kaldi for speaker recognition/language identification, XMU Speech Lab.
Resemblyzer, high-level representation of a voice through a deep learning model (referred to as the voice encoder).
voxceleb audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
Triplet-loss Triplet Loss and Online Triplet Mining in TensorFlow.
Res2Net The Res2net architecture used commonly in VoxCeleb speaker recognition challenge.
voxceleb_trainer A very good speaker framework written in pytorch with pretrained models.
Speechbrain Voxceleb recipe.
kaldi Kaldi recipe for voxceleb.
pytorch_xvectors pytorch implementation of x-vectors.

More-recent papers

Attention Back-end, Compare PLDA and cosine with proposed attention Back-end, model: TDNN, Resnet, data: cn-celeb

Wining solutions of Challenges

VoxSRC2019

Rank 1: FBank, "r-vectors" using resnet, AAM loss. From Brno University of Technolog, REPORT
Rank 2: 80-dim FBank features, E-TDNN/F-TDNN models, various classification loss including softmax/AM-softmax/PLDA-softmax. From Johns Hopkins University, REPORT
Rank 3: FBank, resnet + attentive pooling + Phonetic attention, BLSTM + ResNET, loss unclear(?). From Microsoft, REPORT

VoxSRC2020

Rank 1: 60-dim log-FBank, ECAPA-TDNN/SE-ResNet34, S-Norm, AAM-Softmax. From IDLab, REPORT
Rank 2: 40-dim FBank/mean-normalized, no VAD, resnet/Res2Net, S-Norm, CM-Softmax. From AI Speech, REPORT, kaldi recipe for data-aug
Rank 3: Report not available

Please let me know if your code/repo is not listed here (ranchlai at 163.com)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
res		res
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

res

res

README.md

README.md

Repository files navigation

awesome-speaker-embedding

Must-read papers

Benchmarks (not very accurate)

Must-read technical reports

Datasets

Challenges

Great Talks / Tutorials

Code/Tools/Frameworks/Libraries

More-recent papers

Wining solutions of Challenges

VoxSRC2019

VoxSRC2020

About

ranchlai/awesome-speaker-embedding

Folders and files

Latest commit

History

res

res

README.md

README.md

Repository files navigation

awesome-speaker-embedding

Must-read papers

Benchmarks (not very accurate)

Must-read technical reports

Datasets

Challenges

Great Talks / Tutorials

Code/Tools/Frameworks/Libraries

More-recent papers

Wining solutions of Challenges

VoxSRC2019

VoxSRC2020

About

Topics

Resources

Stars

Watchers

Forks