Skip to content

Studying the outcomes of employing automatic segmentation strategies on end-to-end and cascaded speech translation models

Notifications You must be signed in to change notification settings

mihaitudor9/Segmentation_Strategies_Speech_Translation

Repository files navigation

Segmentation Strategies

This project consisted of comparing three popular VAD toolkits and understanding the outcome of applying automatic segmentations on a state-of-the-art multilanguage translation model compared to a cascaded one.

Screenshot

Manual Segmentations and reference translations can be found in the corresponding sets. The segmentations were performed using the three mentioned toolkits on a local machine. Next, the extraction of the MFCC features was performed on the same local machine by using the Kaldi toolkit. The resulting features (representing all the different segmentations) were uploaded on Google Drive. Then, through a GPU Hardware-accelerated Google Colab session, the cascaded end-to-end and cascaded were employed for each segmentation.


Results

Dominating segmentation strategy employing the end-to-end translation model



Language
pair
Best
segmentation
Segments
count
Segments
Difference*
BLEU score
Difference*
pt-es_test voxseg -s 0.90 1294 23.5% 17.7%
pt-es_valid voxseg -s 0.95 1139 11.7% 18.1%
it-en_test voxseg -s 0.95 1223 22.2% 14.9%
it-en_valid webrtcvad -p 2 1075 14.4% 9.8%
it-es_test voxseg -s 0.95
inaspeech -r 0.15
inaspeech -r 0.20
1223
1228
1335
22.2%
22.6%
30.8%
15.9%
it-es_valid webrtcvad -p 2 1075 14.4% 11.7%
es-en_test webrtcvad -p 0 1116 11.4% 6.5%
es-en_valid webrtcvad -p 0
webrtcvad -p 1
1082
1117
11.4%
14.6%
9.5%
pt-en_test voxseg -s 0.90 1294 23.5% 15.1%
pt-en_valid voxseg -s 0.95
inaspeech -r 0.05
1139
1199
11.7%
16.8%
16.5%

Table displaying the best-found segmentation toolkit, the corresponding parameter, and the number of segments created. *The table also shows the percentage difference in segments counts and BLEU score compared to the scores given by the end-to-end translation model when utilizing the manual segmentation

Dominating segmentation strategy when employing the cascaded translation model




Language
pair
Best
segmentation
Segments
count
Segments
Difference*
BLEU score
Difference*
pt-es_test voxseg -s 0.90 1294 23.5% 19.3%
pt-es_valid inaspeech -r 0.05 1199 16.8% 19.0%
it-en_test inaspeech -r 0.15
inaspeech -r 0.20
1228
1335
22.6%
30.8%
13.1%
it-en_valid webrtcvad -p 2 1075 14.4% 11.0%
it-es_test inaspeech -r 0.15 1228 22.6% 16.6%
it-es_valid voxseg -s 0.90
webrtcvad -p 2
991
1075
6.2%
14.4%
12.8%
es-en_test webrtcvad -p 0 1116 11.4% 6.0%
es-en_valid webrtcvad -p 1 1117 14.6% 7.7%
pt-en_test voxseg -s 0.90 1294 23.5% 16.6%
pt-en_valid inaspeech -r 0.05 1199 16.8% 17.4%

Table displaying the best-found segmentation toolkit, the corresponding parameter, and the number of segments created. *The table also shows the percentage difference in segments counts and BLEU score compared to the scores given by the cascaded translation model when utilizing the manual segmentation.

Releases

No releases published

Packages

No packages published