Speech Recognition Experiments

Experiments to check out different ASR/STT systems and evaluate integration into SEPIA STT-Server.

ASR engines:

Whisper org - The original Whisper version by Open-AI
Whisper TFlite - A Tensorflow Lite compatible Whisper port
Whisper Cpp - A small C++ port of Whisper
Whisper CT2 - An efficient and fast CTranslate2 port of Whisper
Sherpa ncnn - Next-gen Kaldi implementation for streaming ASR
Nvidia NeMo - A toolkit for various end-to-end ASR models and languages
Vosk - Fast, small, accurate (for clear audio), easy to customize. Works with classic Kaldi models. One of the core engines of SEPIA STT Server.

Wake-Word detection:

OpenWakeWord - An robust, NN based, open-source wake-word detection framework with a focus on performance and simplicity.

Other great ASR engines already included in SEPIA:

Coqui STT - Successor of Mozilla's Deep Speech project. End-to-end ASR with CTC decoder and "optional" LMs.

Installation

Each ASR experiment folder has an install bash script, simply run bash install.sh.
Sometimes you will find additional scripts to download models. They should be mentioned during installation.
After a successful installation use bash run-test.sh to run a default test. If the script uses Python you need to activate the right virtual environment first: source venv/bin/activate.

Comments and Impressions

Whisper:
- Whisper in any form, is very accurate, but the missing streaming support is the biggest drawback.
- RTF is not linear. Unfortunately the short files (<4s) need almost the same time to transcribe as the larger ones (>10s).
- For Raspberry Pi 4 based voice assistants you have to wait usually >3s after finishing your input to get a result (bad UX).
- An Orange Pi 5 with optimal Whisper is fast enough to run the 'tiny' model and get good UX (usually <1.5s inference time for every input <30s).
- Whisper CT2 seems to be the best version right now for the Arm64/Aarch64 systems (RPi4 etc.). It has the same speed as the TFlite version or even faster, is smaller in size, works better with non-en languages and has a cleaner API.
Sherpa ncnn:
- Sherpa is very fast and supports streaming audio, but without language model WER is a bit high at the moment. Results look very promising though.
- Example result (file 1, JFK speech): "AND SAW MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY".
- UPDATED 2023.04.29: Included better English model.
Nvidia NeMo:
- Nvidia NeMo small models (e.g. 'en_conformer_ctc_small') are very fast and precise for clear and simple audio files.
- Unfortunately NeMo has no pre-trained models for streaming conformer yet (2023.03.07)
- Non-streaming is a bit faster than Sherpa-ncnn but way more precise
- The test results below currently indicate the quality is as good as Whisper, but more complicated vocabulary and noisy audio quickly shows that Whisper still performs much better, especially compared to larger NeMo models.
- NeMo can be tuned easily using (phoneme free!) language models. Depending on your beam parameters (width, alpha, beta) accuracy for your LM vocabulary can increase dramatically, while it will drop for out-of-vocabulary words.
Vosk:
- Vosk is very small, fast, supports streaming audio and you can convert most of the classic Kaldi models to work with it.
- The small models are only ~50MB and surprisingly good, even for general dictation tasks ... if your input audio isn't too noisy and your vocabulary not too complicated.
- The larger models are solid, but I never really use them, because they are much slower, need more RAM and don't offer much better results in my everyday tests with SEPIA assistant.
- If you want good accuracy in a specific domain you should train your own language model. The Vosk homepage has some documentation, but for SEPIA I use the kaldi-adapt-lm repo.
- Vosk with a custom LM is probably your best open-source ASR choice on low-end hardware.

Benchmarks

Test notes:

File 1 is en_speech_jfk_11s.wav
File 2 is en_sh_lights_70pct_4s.wav
All Whisper tests are done without language detection!
Whisper TFlite (slim) is the tflite_runtime package built with Bazel (faster than default!)
Whisper Cpp is built with default settings ('NEON = 1', 'BLAS = 0') and Whisper Cpp (BLAS) with OpenBlas
Whisper CT2 uses the 'int8' model
Quality is a subjective impression of the transcribed result (TODO: replace with WER)
Sherpa model small-2023-01-09 full name is conv-emformer-transducer-small-2023-01-09

Raspberry Pi 400 - Aarch64 - Debian Bullseye

Test date: 2023.02.17

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Whisper original	tiny	1	4	-	5.9s	0.54	perfect
Whisper original	tiny	2	4	-	4.3s	1.19	perfect
Whisper TFlite	tiny.en	1	4	-	4.1s	0.37	perfect
Whisper TFlite	tiny.en	2	4	-	3.4s	0.94	perfect
Whisper TFlite (slim)	tiny.en	1	4	-	3.9s	0.36	perfect
Whisper TFlite (slim)	tiny.en	2	4	-	3.2s	0.90	perfect
Whisper TFlite (slim)	tiny	1	4	-	4.7s	0.43	perfect
Whisper TFlite (slim)	tiny	2	4	-	3.8s	1.06	perfect
Whisper Cpp	ggml-tiny	1	4	-	9.1s	0.83	perfect
Whisper Cpp	ggml-tiny	2	4	-	8.6s	2.39	perfect
Whisper Cpp (BLAS)	ggml-tiny	1	4	-	8.4s	0.76	perfect
Whisper Cpp (BLAS)	ggml-tiny	2	4	-	8.0s	2.22	perfect
Whisper CT2	whisper-tiny-ct2	1	4	-	3.9s	0.36	perfect
Whisper CT2	whisper-tiny-ct2	2	4	-	3.2s	0.90	perfect
Sherpa ncnn	small-2023-01-09	1	4	+	2.0s	0.18	okayish
Sherpa ncnn	small-2023-01-09	2	4	+	0.6s	0.18	low

Test date: 2023.03.07

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Nvidia NeMo	en_conformer_ctc_small	1	4	-	1.1s	0.10	perfect
Nvidia NeMo	en_conformer_ctc_small	2	4	-	0.5s	0.14	perfect

Orange Pi 5 8GB - Aarch64 - Armbian Bullseye (Kernel 5.10.110-rockchip-rk3588)

Test date: 2023.02.19

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Whisper original	tiny	1	4	-	3.0s	0.27	perfect
Whisper original	tiny	2	4	-	1.9s	0.53	perfect
Whisper TFlite (slim)	tiny	1	4	-	1.4s	0.13	perfect
Whisper TFlite (slim)	tiny	2	4	-	1.4s	0.39	perfect
Whisper Cpp (BLAS)	ggml-tiny	1	4	-	3.7s	0.34	perfect
Whisper Cpp (BLAS)	ggml-tiny	2	4	-	3.5s	0.97	perfect
Whisper CT2	whisper-tiny-ct2	1	4	-	1.3s	0.12	perfect
Whisper CT2	whisper-tiny-ct2	2	4	-	1.4s	0.39	perfect

Test date: 2023.03.07

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Sherpa ncnn	small-2023-01-09	1	4	+	0.6s	0.05	okayish
Sherpa ncnn	small-2023-01-09	2	4	+	0.2s	0.06	low
Nvidia NeMo	en_conformer_ctc_small	1	4	-	0.4s	0.03	perfect
Nvidia NeMo	en_conformer_ctc_small	2	4	-	0.2s	0.06	perfect

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
nvidia-nemo		nvidia-nemo
open-wake-word		open-wake-word
sherpa-ncnn		sherpa-ncnn
test-files		test-files
vosk		vosk
whisper-cpp		whisper-cpp
whisper-ct2		whisper-ct2
whisper-org		whisper-org
whisper-tflite		whisper-tflite
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

nvidia-nemo

nvidia-nemo

open-wake-word

open-wake-word

sherpa-ncnn

sherpa-ncnn

test-files

test-files

vosk

vosk

whisper-cpp

whisper-cpp

whisper-ct2

whisper-ct2

whisper-org

whisper-org

whisper-tflite

whisper-tflite

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Speech Recognition Experiments

Installation

Comments and Impressions

Benchmarks

Raspberry Pi 400 - Aarch64 - Debian Bullseye

Orange Pi 5 8GB - Aarch64 - Armbian Bullseye (Kernel 5.10.110-rockchip-rk3588)

About

Releases 1

Sponsor this project

Languages

License

fquirin/speech-recognition-experiments

Folders and files

Latest commit

History

Repository files navigation

Speech Recognition Experiments

Installation

Comments and Impressions

Benchmarks

Raspberry Pi 400 - Aarch64 - Debian Bullseye

Orange Pi 5 8GB - Aarch64 - Armbian Bullseye (Kernel 5.10.110-rockchip-rk3588)

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages