tokenizer
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Here are 1,071 public repositories matching this topic...
DOM-aware tokenization for Hugging Face language models
-
Updated
May 22, 2024 - HTML
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
-
Updated
May 22, 2024 - Python
🧪 N-Gram Tools for 🙃 Phony Language that includes features like sanitizing, tokenization, n-gram extraction, frequency mapping.
-
Updated
May 22, 2024 - PHP
[READ ONLY] Locate available classes by parent, interface or trait. Subtree split of the Spiral Tokenizer component (see spiral/framework)
-
Updated
May 22, 2024 - PHP
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
-
Updated
May 22, 2024 - Python
DadmaTools is a Persian NLP tools developed by Dadmatech Co.
-
Updated
May 22, 2024 - Python
Taiwanese Hokkien Transliterator and Tokeniser
-
Updated
May 22, 2024 - JavaScript
⛄ Possibly the smallest Lua compiler ever
-
Updated
May 22, 2024 - Lua
Taiwanese Hokkien Transliterator and Tokeniser
-
Updated
May 22, 2024 - Python
An elegant Math Parser written in Lua, featuring support for adding custom operators and functions
-
Updated
May 21, 2024 - Lua
Tokenization utilities for building parsers in Rust
-
Updated
May 21, 2024 - Rust
Lua Compiler, (De)Obfuscator, Minifier, Beautifier, And more
-
Updated
May 21, 2024 - Lua
Oxide is a hybrid database and streaming messaging system (think Kafka + MySQL); supporting data access via REST and SQL.
-
Updated
May 20, 2024 - Rust
- Followers
- 10.1k followers
- Wikipedia
- Wikipedia