-
-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer for multiple encodings #213
Comments
Hi, Thanks for creating the issue. Both solutions are okay for me(porting or using a different library) but before I need to do a bit of research about it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.
TokenizerGpt3 mainly refers to openai-tools. After reading the source code, its implementation mainly refers to data_gym_to_mergeable_bpe_ranks, which requires an encoder.json and a vocab.bpe at runtime. According to openai_public, this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).
Starting from r50k_base, the tokenizer implementation has changed to load_tiktoken_bpe, which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: TiktokenSharp and SharpToken, both implemented in this way.
Describe the solution you'd like
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.
Describe alternatives you've considered
Or maybe we can just use TiktokenSharp.
The text was updated successfully, but these errors were encountered: