Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

Open
2 of 4 tasks
tshmak opened this issue Apr 30, 2024 · 2 comments
Open
2 of 4 tasks

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

tshmak opened this issue Apr 30, 2024 · 2 comments
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@tshmak
Copy link

tshmak commented Apr 30, 2024

System Info

  • transformers version: 4.29.2
  • Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
  • Python version: 3.11.8
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.2
  • PyTorch version (GPU?): 2.2.2 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Example script:

import json
from transformers import Wav2Vec2CTCTokenizer

phones = 'ɔː,(en),oɜ,inɡɜ,o4,b,f,enɡ1,oi2,aa7,eɪ,eː,au7,aaiɜ,onɡ4,oe6,uiɜ,ɒ,iə,c,aa2,oenɡ1,ei7,oenɡ6,au1,ŋ5,iu5,aɪə,ou4,d,ai7,k,i2,eoi5,aai2,j,oenɡɜ,u1,ŋ4,i,m,oi6,unɡɜ,ou2,au2,p,yu1,a,yu4,onɡ1,ɛ,e5,əʊ,ou6,yu5,aɜ,oi1,onɡ5,ai5,aau5,inɡ5,ai1,eɜ,ei5,uɜ,o2,i5,nɡ6,enɡ4,ɐ,l,o1,iu4,enɡ6,ou5,onɡ7,anɡ1,tʃ,aau2,eo6,aa6,iː,enɡ7,oenɡ5,ŋ,aau1,u5,eo5,yu7,oi7,aaɜ,oiɜ,yu2,aa5,ɑː,oe1,n,eoi2,ui2,oenɡ2,inɡ1,anɡ4,t,au4,ei4,u2,aanɡ2,ui4,dʒ,[PAD],a1,e,oenɡ7,aau4,onɡɜ,eoi6,unɡ5,ɹ,e6,yu6,ɪ,ʃ,ei2,aauɜ,enɡɜ,unɡ1,aɪ,i6,eiɜ,aanɡ1,inɡ6,iu1,o5,ui1,inɡ2,unɡ4,eoi4,eo4,uː,ei1,oenɡ4,aa4,aanɡ7,a2,e4,enɡ2,a5,auɜ,iɜ,əl,ai6,iu2,a4,e2,ouɜ,eoi1,anɡ2,[UNK],h,onɡ6,aau6,nɡ5,nɡ4,enɡ5,oeɜ,inɡ4,a6,eoiɜ,e1,ʊ,i1,o7,z,au6,ai4,anɡ6,aai1,oi5,aʊ,v,iu6,unɡ7,au5,eoɜ,aanɡ6,ou1,aanɡ5,(zhy),anɡɜ,oi4,onɡ2,a7,w,ui5,ui6,oe5,unɡ6,aanɡ4,ɔɪ,inɡ7,ɡ,s,o6,aa1,u6,aai4,ʌ,ou7,yuɜ,ɜː,ei6,aiɜ,ə,anɡ7,ai2,u4,iu7,iuɜ,eo1,aai6,eo2,i4,i7,aai5,unɡ2'.split(',')

phones_dict = {x:i for i, x in enumerate(phones)}
with open('test.json', 'w') as f: 
    json.dump(phones_dict, f, indent=4, ensure_ascii=False)

tokenizer = Wav2Vec2CTCTokenizer('test.json', unk_token='[UNK]', pad_token='[PAD]')
text = 'ɡ ei2 j a4 n ɡ eɜ p anɡ4 j au5 t aaɜ n z o2 j a1 t h au2 h eiɜ'
print(tokenizer(text))
print(tokenizer.decode(tokenizer(text)['input_ids'], spaces_between_special_tokens=True)) 

Output:

{'input_ids': [200, 122, 35, 152, 96, 157, 200, 62, 45, 101, 35, 182, 102, 90, 96, 157, 172, 65, 35, 110, 102, 157, 158, 44, 158, 128], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ

Expected behavior

The unknown token (157, [UNK]) in this case should not be part of the encoded input.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Apr 30, 2024

>>> print( tokenizer.tokenize(text))
['ɡ', 'ei2', 'j', 'a4', 'n', '|', 'ɡ', 'eɜ', 'p', 'anɡ4', 'j', 'au5', 't', 'aaɜ', 'n', '|', 'z', 'o2', 'j', 'a1', 't', '|', 'h', 'au2', 'h', 'eiɜ']

the [UNK] comes from the '|' which does not seem to be part of your vocab:

>>> '|' in tokenizer.get_vocab()
False

@ArthurZucker ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Apr 30, 2024
@tshmak
Copy link
Author

tshmak commented May 7, 2024

But why does it insert '|' at somewhat random locations? As you can see, I have no word boundaries in text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

2 participants