Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HF tokenization hack to cover multiple special tokens #649

Closed

Conversation

shawnz
Copy link
Contributor

@shawnz shawnz commented Feb 21, 2024

Like some others (#620, #609, #581, #556, #555, #494, #479) I have been running into issues with an assertion failure when my prompt contains special tokens. Particularly, it seems like the existing workaround doesn't account for multiple special tokens being present in the prompt, but this is necessary for example if you want to use the chat template provided with models such as HuggingFaceH4/zephyr-7b-beta, which contains special tokens as part of the template.

I don't fully understand the nature of the workaround or why it is necessary, but after some experimentation, the changes here made it work for my use case. I am posting them here in case it helps anyone experiencing the same issue and also in hopes of creating a discussion that might lead to a more robust solution. Perhaps @slundberg, who I believe implemented the original workaround, can chime in and let me know if I am on the right track with this change.

Thanks, Shawn

@slundberg
Copy link
Contributor

Thanks! This is because if you take the string "my text" and tokenize it with some HF models and then untokenize it you get " my text"...so the tokenizers are not invertible because they add spaces after special tokens. It is annoying and seems wrong, but is not something we can change, hence the hack. Your update does not seem wrong, but since it seems to change the token count unit test I suspect it may be brittle to different model types. Do you mind adding a few tests to validate this for a few common models so make sure it does not break anything?

Long term I am hoping we can get around much of this by using a speculative decoder model. If we integrate with a speculative decoder we won't need to clean up tokens after forcing them (since the speculative decoder should follow the constraints)...and so we won't have to use the function at all. But of course that all depends on having access to a spec decoder which won't always be true :)

@shawnz
Copy link
Contributor Author

shawnz commented Feb 22, 2024

You are right, this change is brittle and seems to have broken several tests! I am working on making it work in more cases right now and I think I might have a viable solution that fixes the existing tests while also making it work for my use case. In the mean time, if anyone has any suggestions or ideas for different approaches, I'm all ears!

@shawnz shawnz marked this pull request as draft February 22, 2024 20:50
@shawnz
Copy link
Contributor Author

shawnz commented Feb 27, 2024

Closing this as I ultimately can't find a better solution which accounts for all the possible edge cases. That said, I was able to get it to work in my original use case by setting use_fast=False and legacy=False when calling AutoTokenizer.from_pretrained(...). Others facing similar issues might want to try and experiment with those options.

@shawnz shawnz closed this Feb 27, 2024
@shawnz
Copy link
Contributor Author

shawnz commented Mar 1, 2024

Note that a competing project "outlines" is implementing a similar feature and they seem to have settled on a method that might be able to avoid some of these hacks? Perhaps it could be a source of inspiration for a future look at this, see: outlines-dev/outlines#161 and outlines-dev/outlines#531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants