Sentence piece unigrams vs BPE for citrinet-like models #2129

jprobichaud · 2021-04-28T00:50:30Z

jprobichaud
Apr 28, 2021

I'm curious what people think makes more sense: for CTC-based models, like Citrinet, would sentence piece BPE tokenizer be better than unigram tokenizers?

I have some evidence that suggests that unigram tokenizers would fly well with attention decoders but make the life of a CTC decoder much harder.

Anyone as notice that too? Any potential justification for this?

titu1994 · 2021-04-28T02:22:43Z

titu1994
Apr 28, 2021
Maintainer

Good question ! We use WPE for Librispeech models, but unigram for most of the Nemo checkpoint releases. We did experiment with SPE BPE, and found no significant difference in WER for Librispeech compared to either unigram or WPE. However these weren't extensive analysis so there is a margin for noise.

Overall, it would be interesting to know if SPE unigram does better than BPE for a particular dataset / case.

1 reply

jprobichaud Apr 28, 2021
Author

Thanks. We are still trying to understand the impact of all this given different architectures on our data. I'll try to remember to update this thread once we'll have some usable conclusions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence piece unigrams vs BPE for citrinet-like models #2129

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Sentence piece unigrams vs BPE for citrinet-like models #2129

jprobichaud Apr 28, 2021

Replies: 1 comment · 1 reply

titu1994 Apr 28, 2021 Maintainer

jprobichaud Apr 28, 2021 Author

jprobichaud
Apr 28, 2021

Replies: 1 comment 1 reply

titu1994
Apr 28, 2021
Maintainer

jprobichaud Apr 28, 2021
Author