New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about retraining/fine-tuning EncoderModel with new words in `t5.get_tokenizer()` #358

Open

Kevin7720 opened this issue Jul 25, 2023 · 2 comments

Kevin7720 commented Jul 25, 2023

That I have added some new words to t5.get_tokenizer() as shown below:

def get_tokenizer(name):
    tokenizer = T5Tokenizer.from_pretrained(name, model_max_length=MAX_LENGTH)
    new_words  =['XXX', 'OOO', ......]
    tokenizer.add_tokens(new_words)
    return tokenizer

I would like to understand if I need to retrain or fine-tune the EncoderModel after adding these new words to the tokenizer. How will this modification affect the model's performance or behavior?

This question is related to the Imagen project, and I want to ensure that I am following the correct approach when incorporating new words into the tokenizer.

The text was updated successfully, but these errors were encountered:

Contributor

jacobwjs commented Aug 24, 2023

I'm not exactly sure what you mean, but what you proposed won't get you there.

See here:
https://github.com/huggingface/transformers/blob/70b49f023c9f6579c516671604468a491227b4da/src/transformers/tokenization_utils_base.py#L863

When you add new tokens to the vocabulary (and add the entry in the embedding layer), you'll end up with randomly initialized values corresponding to the new token(s).

Author

Kevin7720 commented Aug 30, 2023

Thank you for your reply! Your answer has been incredibly helpful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment