Tokenizer for multiple encodings #213

Frogley · 2023-03-31T06:32:22Z

Is your feature request related to a problem? Please describe.
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.

TokenizerGpt3 mainly refers to openai-tools. After reading the source code, its implementation mainly refers to data_gym_to_mergeable_bpe_ranks, which requires an encoder.json and a vocab.bpe at runtime. According to openai_public, this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).

Starting from r50k_base, the tokenizer implementation has changed to load_tiktoken_bpe, which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: TiktokenSharp and SharpToken, both implemented in this way.

Describe the solution you'd like
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.

Describe alternatives you've considered
Or maybe we can just use TiktokenSharp.

kayhantolga · 2023-04-03T21:58:24Z

Hi, Thanks for creating the issue. Both solutions are okay for me(porting or using a different library) but before I need to do a bit of research about it.

kayhantolga added the bug Something isn't working label Apr 15, 2023

kayhantolga added this to the 8.0.2 milestone Apr 6, 2024

kayhantolga modified the milestones: 8.0.2, 8.0.4 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer for multiple encodings #213

Tokenizer for multiple encodings #213

Frogley commented Mar 31, 2023

kayhantolga commented Apr 3, 2023

Tokenizer for multiple encodings #213

Tokenizer for multiple encodings #213

Comments

Frogley commented Mar 31, 2023

kayhantolga commented Apr 3, 2023