Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about retraining/fine-tuning EncoderModel with new words in t5.get_tokenizer() #358

Open
Kevin7720 opened this issue Jul 25, 2023 · 2 comments

Comments

@Kevin7720
Copy link

That I have added some new words to t5.get_tokenizer() as shown below:

def get_tokenizer(name):
    tokenizer = T5Tokenizer.from_pretrained(name, model_max_length=MAX_LENGTH)
    new_words  =['XXX', 'OOO', ......]
    tokenizer.add_tokens(new_words)
    return tokenizer

I would like to understand if I need to retrain or fine-tune the EncoderModel after adding these new words to the tokenizer. How will this modification affect the model's performance or behavior?

This question is related to the Imagen project, and I want to ensure that I am following the correct approach when incorporating new words into the tokenizer.

@jacobwjs
Copy link
Contributor

I'm not exactly sure what you mean, but what you proposed won't get you there.

See here:
https://github.com/huggingface/transformers/blob/70b49f023c9f6579c516671604468a491227b4da/src/transformers/tokenization_utils_base.py#L863

When you add new tokens to the vocabulary (and add the entry in the embedding layer), you'll end up with randomly initialized values corresponding to the new token(s).

@Kevin7720
Copy link
Author

Thank you for your reply! Your answer has been incredibly helpful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants