The first version of Yodi is built on the umini_speech dataset, the mini speech dataset which was developed and is currently being improved by the Umbaji team. It is a set of more than 800 augmented recordings from various users, in different settings speaking the same words as in the mini speech dataset but in Ewè.
The Yodi model was developed by applying STFT transforms to speech data to get the waveforms for training with more meaningful and embeddable data. For the audio inputs, the Ewè words vector space is:
We consider the word labels (transcripts) vector space for the recognition,
Consult the Yodi package here :https://github.com/Umbaji/Yodi