An emotion classifier built using standard audio data processing and deep learning algorithms. Here, we have 4 different datasets with a 12,000+ audio files and a plethora of voice actors to generalize the model and avoid overfitting over a certain accent. Due to the sheer complexity of SER (Speech Emotion Recognition), the accuracy will be 60-70% only. However. we've tried to give you a brief comparison of various decisions over the accuracy.
Datasets used:
- Crowd-sourced Emotional Mutimodal Actors Dataset (Crema-D)
- Ryerson Audio-Visual Database of Emotional Speech and Song (Ravdess)
- Surrey Audio-Visual Expressed Emotion (Savee)
- Toronto emotional speech set (Tess)
Algorithm used: Sequential with 1D convolution layer (Conv1D) & Maxpooling