Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

tshmak · 2024-04-30T03:42:41Z

System Info

transformers version: 4.29.2
Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
Python version: 3.11.8
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
PyTorch version (GPU?): 2.2.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Example script:

import json
from transformers import Wav2Vec2CTCTokenizer

phones = 'ɔː,(en),oɜ,inɡɜ,o4,b,f,enɡ1,oi2,aa7,eɪ,eː,au7,aaiɜ,onɡ4,oe6,uiɜ,ɒ,iə,c,aa2,oenɡ1,ei7,oenɡ6,au1,ŋ5,iu5,aɪə,ou4,d,ai7,k,i2,eoi5,aai2,j,oenɡɜ,u1,ŋ4,i,m,oi6,unɡɜ,ou2,au2,p,yu1,a,yu4,onɡ1,ɛ,e5,əʊ,ou6,yu5,aɜ,oi1,onɡ5,ai5,aau5,inɡ5,ai1,eɜ,ei5,uɜ,o2,i5,nɡ6,enɡ4,ɐ,l,o1,iu4,enɡ6,ou5,onɡ7,anɡ1,tʃ,aau2,eo6,aa6,iː,enɡ7,oenɡ5,ŋ,aau1,u5,eo5,yu7,oi7,aaɜ,oiɜ,yu2,aa5,ɑː,oe1,n,eoi2,ui2,oenɡ2,inɡ1,anɡ4,t,au4,ei4,u2,aanɡ2,ui4,dʒ,[PAD],a1,e,oenɡ7,aau4,onɡɜ,eoi6,unɡ5,ɹ,e6,yu6,ɪ,ʃ,ei2,aauɜ,enɡɜ,unɡ1,aɪ,i6,eiɜ,aanɡ1,inɡ6,iu1,o5,ui1,inɡ2,unɡ4,eoi4,eo4,uː,ei1,oenɡ4,aa4,aanɡ7,a2,e4,enɡ2,a5,auɜ,iɜ,əl,ai6,iu2,a4,e2,ouɜ,eoi1,anɡ2,[UNK],h,onɡ6,aau6,nɡ5,nɡ4,enɡ5,oeɜ,inɡ4,a6,eoiɜ,e1,ʊ,i1,o7,z,au6,ai4,anɡ6,aai1,oi5,aʊ,v,iu6,unɡ7,au5,eoɜ,aanɡ6,ou1,aanɡ5,(zhy),anɡɜ,oi4,onɡ2,a7,w,ui5,ui6,oe5,unɡ6,aanɡ4,ɔɪ,inɡ7,ɡ,s,o6,aa1,u6,aai4,ʌ,ou7,yuɜ,ɜː,ei6,aiɜ,ə,anɡ7,ai2,u4,iu7,iuɜ,eo1,aai6,eo2,i4,i7,aai5,unɡ2'.split(',')

phones_dict = {x:i for i, x in enumerate(phones)}
with open('test.json', 'w') as f: 
    json.dump(phones_dict, f, indent=4, ensure_ascii=False)

tokenizer = Wav2Vec2CTCTokenizer('test.json', unk_token='[UNK]', pad_token='[PAD]')
text = 'ɡ ei2 j a4 n ɡ eɜ p anɡ4 j au5 t aaɜ n z o2 j a1 t h au2 h eiɜ'
print(tokenizer(text))
print(tokenizer.decode(tokenizer(text)['input_ids'], spaces_between_special_tokens=True))

Output:

{'input_ids': [200, 122, 35, 152, 96, 157, 200, 62, 45, 101, 35, 182, 102, 90, 96, 157, 172, 65, 35, 110, 102, 157, 158, 44, 158, 128], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ

Expected behavior

The unknown token (157, [UNK]) in this case should not be part of the encoded input.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-30T10:18:17Z

>>> print( tokenizer.tokenize(text))
['ɡ', 'ei2', 'j', 'a4', 'n', '|', 'ɡ', 'eɜ', 'p', 'anɡ4', 'j', 'au5', 't', 'aaɜ', 'n', '|', 'z', 'o2', 'j', 'a1', 't', '|', 'h', 'au2', 'h', 'eiɜ']

the [UNK] comes from the '|' which does not seem to be part of your vocab:

>>> '|' in tokenizer.get_vocab()
False

tshmak · 2024-05-07T03:05:36Z

But why does it insert '|' at somewhat random locations? As you can see, I have no word boundaries in text.

tshmak mentioned this issue Apr 30, 2024

Why are 'unknown' tokens randomly added to my tokenized input? huggingface/tokenizers#1520

Closed

ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

tshmak commented Apr 30, 2024

ArthurZucker commented Apr 30, 2024 •

edited

tshmak commented May 7, 2024

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

Comments

tshmak commented Apr 30, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Apr 30, 2024 • edited

tshmak commented May 7, 2024

ArthurZucker commented Apr 30, 2024 •

edited