Skip to content

lighttransport/nanotokenizer

Repository files navigation

Nanoscale tokenizer in C++

Nanoscale tokenizer in C++. Currently RWKV world tokenizer is implemented.

Features

  • Easy to embed
  • Read vocab from JSON(through minijson)

Variants

  • Naiive Trie tree implementation : rwkv_world_tokenizer_trie.hh
  • Efficient version using hat-trie : rwkv_world_tokenizer_hat.hh
  • Efficient version using cedar : rwkv_world_tokenizer_cedar.hh

If you want to run tokenizer with no C++ exception(e.g. WASM), naiive or cedar version recommended to use.

Additional feature to original RWKV world tokenizer.

  • UTF-8 byte fallback

TODO

  • Make C++ Exception free

Third party libraries

Releases

No releases published

Packages

No packages published

Languages