Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Text Tokenizer #47

Open
Vijay-Nirmal opened this issue Mar 16, 2023 · 20 comments
Open

Add Text Tokenizer #47

Vijay-Nirmal opened this issue Mar 16, 2023 · 20 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@Vijay-Nirmal
Copy link

Feature Request

Add a way to tokenize text so that it can be passed as an input (like logit_bias) for models

Is your feature request related to a problem? Please describe.

I am trying to use OpenAI APIs like completion, in that, there is an option to pass "logit_bias" but to currently there is no wat to generate the proper token of a text in order to pass in that.

Describe the solution you'd like

.Net implementation of OpenAI's Tokenizer

Describe alternatives you've considered

There is an existing MIT-licenced nuget package called GPT-3-Encoder-Sharp that does it.

@Vijay-Nirmal Vijay-Nirmal added the enhancement New feature or request label Mar 16, 2023
@StephenHodgson StephenHodgson added help wanted Extra attention is needed good first issue Good for newcomers labels Mar 16, 2023
@StephenHodgson
Copy link
Member

Is there anything preventing you from using GPT-3-Encoder-Sharp with this package?

@StephenHodgson
Copy link
Member

StephenHodgson commented Mar 16, 2023

Honestly, I would rather OpanAI add an endpoint specifically to do this. They have their own tokenizer utility page that gives you an idea about how many tokens but the encoder is different per model.

I may not pick up this issue only because it's a moving target and there's other nuget packages that can handle this task.

@StephenHodgson StephenHodgson changed the title Add Text tokenizer Add Text Tokenizer Mar 16, 2023
@Vijay-Nirmal
Copy link
Author

Vijay-Nirmal commented Mar 16, 2023

Even OpenAI its recommending one third package called gpt-3-encoder
image

I may not pick up this issue only because it's a moving target

We don't have to worry about changing/evolving encoder logic because the original encode by OpenAI was released 4 years ago and there were no changes to the encoder logic till now. The encoding logic of GPT-2 and GPT-3 is the same.

@StephenHodgson
Copy link
Member

but encoder for gpt-4 is different

@Vijay-Nirmal
Copy link
Author

Not sure about gpt-4, (But I don't think so) but my point is that it won't change or evolve. If GPT-4 has a different encoding, we can write one-time encoding logic for GPT-4, and it will never change.

@StephenHodgson
Copy link
Member

That's not what I heard
https://news.ycombinator.com/item?id=34008839

@StephenHodgson
Copy link
Member

In either case, like I said, I won't be picking up this task, but PRs are always welcome.

@Vijay-Nirmal
Copy link
Author

Sure, I will do it over the weekend.

@StephenHodgson
Copy link
Member

I still don't understand why the package you referenced before isn't a sufficient substitute?

@Vijay-Nirmal
Copy link
Author

Just want one OpenAI package to do everything related to OpenAI, that's all. It's up to you, feel free to close the issue. 🙂

@StephenHodgson
Copy link
Member

I'll leave it open if you plan to open a PR, I was just curious more than anything.

@StephenHodgson
Copy link
Member

https://github.com/aiqinxuancai/TiktokenSharp

Here's another good reference. I like that they're also pulling tiktoken

@Vijay-Nirmal
Copy link
Author

@StephenHodgson I referred the OpenAI's implementation, they also pull tokens from the blob.

In their code, I found one interesting comment, it says, "# TODO: these will likely be replaced by an API endpoint". Now my question is that are you still open to have our own custom implementation or wait for the API endpoint?

@StephenHodgson
Copy link
Member

Nice, looks like they took my suggestion seriously

@StephenHodgson
Copy link
Member

I guess it doesn't hurt to do it. And then when the API becomes available replace it.

@StephenHodgson
Copy link
Member

@logankilpatrick any internal support on adding API for tokenizer?

@HavenDV
Copy link

HavenDV commented Jun 22, 2023

I think it's best to use either the Microsoft version or the slightly faster Tiktoken for the time being if optimization is needed.
https://github.com/microsoft/Tokenizer
https://github.com/tryAGI/Tiktoken

henkin pushed a commit to henkin/chatgpt that referenced this issue Oct 6, 2023
@StephenHodgson
Copy link
Member

I agree, I think the msft package should be easily integrable. I may consider adding it as a dependency.

@r-Larch
Copy link

r-Larch commented Apr 4, 2024

I recommend using SharpToken because it is the fastest with lowest memory consumption thanks to my latest PR to that repository.

Benchmark Code

Benchmark results:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]               : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET 6.0             : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
  .NET 8.0             : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
Method Job Runtime Mean Error StdDev Gen0 Gen1 Allocated
SharpToken .NET 8.0 .NET 8.0 100.4 ms 1.95 ms 1.91 ms 2000.0000 - 22.13 MB
SharpToken .NET 6.0 .NET 6.0 169.9 ms 2.42 ms 2.15 ms 24333.3333 1000.0000 196.3 MB
SharpToken .NET Framework 4.7.1 .NET Framework 4.7.1 455.3 ms 8.34 ms 6.97 ms 34000.0000 1000.0000 204.39 MB
TiktokenSharp .NET 8.0 .NET 8.0 211.4 ms 1.83 ms 1.53 ms 42000.0000 1000.0000 338.98 MB
TiktokenSharp .NET 6.0 .NET 6.0 258.6 ms 5.09 ms 6.25 ms 39000.0000 1000.0000 313.26 MB
TiktokenSharp .NET Framework 4.7.1 .NET Framework 4.7.1 638.3 ms 12.47 ms 16.21 ms 63000.0000 1000.0000 378.31 MB
TokenizerLib .NET 8.0 .NET 8.0 124.4 ms 1.81 ms 1.60 ms 27250.0000 1000.0000 217.82 MB
TokenizerLib .NET 6.0 .NET 6.0 165.5 ms 1.38 ms 1.16 ms 27000.0000 1000.0000 217.82 MB
TokenizerLib .NET Framework 4.7.1 .NET Framework 4.7.1 499.7 ms 9.81 ms 14.07 ms 40000.0000 1000.0000 243.79 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Development

No branches or pull requests

4 participants