chore: Add model vocab support #7117

teleprint-me · 2024-05-07T05:52:34Z

supersedes #7018

Adds the following models:

phi
stablelm
~~qwen~~ (qwen2 supersedes qwen)
mistral
mixtral

Adds the following extras:

Adds the stablelm vocab
Adds the generate-vocab.sh script
Adds ability to generate the generate-vocab.sh script

Galunid

Is this work in progress? It seems to miss implementation in llama.cpp. Those tokenizers won't be recognized leading to crash with runtime error.

teleprint-me · 2024-05-07T17:48:43Z

@Galunid Yes, it's still a work in progress. I was passing out while still implementing because I wanted to get it out of the way, so I decided to pause for a bit.

Signed-off-by: teleprint-me <[email protected]>

github-actions · 2024-05-08T01:48:48Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8559.1ms p(95)=21749.4ms fails=, finish reason: stop=492 truncated=55
Prompt processing (pp): avg=102.49tk/s p(95)=412.6tk/s
Token generation (tg): avg=34.05tk/s p(95)=46.61tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=add-stablelm-hash commit=9269594919bb9952b176c70606185f805a932ed7

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 648.31, 648.31, 648.31, 648.31, 648.31, 916.41, 916.41, 916.41, 916.41, 916.41, 904.46, 904.46, 904.46, 904.46, 904.46, 907.57, 907.57, 907.57, 907.57, 907.57, 903.61, 903.61, 903.61, 903.61, 903.61, 922.45, 922.45, 922.45, 922.45, 922.45, 906.52, 906.52, 906.52, 906.52, 906.52, 906.88, 906.88, 906.88, 906.88, 906.88, 913.23, 913.23, 913.23, 913.23, 913.23, 925.09, 925.09, 925.09, 925.09, 925.09, 919.09, 919.09, 919.09, 919.09, 919.09, 933.9, 933.9, 933.9, 933.9, 933.9, 883.78, 883.78, 883.78, 883.78, 883.78, 879.68, 879.68, 879.68, 879.68, 879.68, 898.41, 898.41, 898.41, 898.41, 898.41, 899.02, 899.02, 899.02, 899.02, 899.02, 900.54, 900.54, 900.54, 900.54, 900.54, 898.64, 898.64, 898.64, 898.64, 898.64, 894.84, 894.84, 894.84, 894.84, 894.84, 904.9, 904.9, 904.9, 904.9, 904.9, 903.2, 903.2, 903.2, 903.2, 903.2, 899.81, 899.81, 899.81, 899.81, 899.81, 903.44, 903.44, 903.44, 903.44, 903.44, 901.0, 901.0, 901.0, 901.0, 901.0, 903.8, 903.8, 903.8, 903.8, 903.8, 871.43, 871.43, 871.43, 871.43, 871.43, 874.13, 874.13, 874.13, 874.13, 874.13, 875.33, 875.33, 875.33, 875.33, 875.33, 886.67, 886.67, 886.67, 886.67, 886.67, 882.93, 882.93, 882.93, 882.93, 882.93, 879.11, 879.11, 879.11, 879.11, 879.11, 879.19, 879.19, 879.19, 879.19, 879.19, 883.37, 883.37, 883.37, 883.37, 883.37, 881.27, 881.27, 881.27, 881.27, 881.27, 880.54, 880.54, 880.54, 880.54, 880.54, 887.64, 887.64, 887.64, 887.64, 887.64, 883.16, 883.16, 883.16, 883.16, 883.16, 886.99, 886.99, 886.99, 886.99, 886.99, 879.96, 879.96, 879.96, 879.96, 879.96, 878.33, 878.33, 878.33, 878.33, 878.33, 876.78, 876.78, 876.78, 876.78, 876.78, 879.32, 879.32, 879.32, 879.32, 879.32, 878.86, 878.86, 878.86, 878.86, 878.86, 888.15, 888.15, 888.15, 888.15, 888.15, 882.37, 882.37, 882.37, 882.37, 882.37, 830.06, 830.06, 830.06, 830.06, 830.06, 828.48, 828.48, 828.48, 828.48, 828.48, 826.01, 826.01, 826.01, 826.01, 826.01, 821.42, 821.42, 821.42, 821.42, 821.42, 823.23, 823.23, 823.23, 823.23, 823.23, 825.83, 825.83, 825.83, 825.83, 825.83, 828.57, 828.57, 828.57, 828.57, 828.57, 831.63, 831.63, 831.63, 831.63, 831.63, 835.7, 835.7, 835.7, 835.7, 835.7, 835.51, 835.51, 835.51, 835.51, 835.51, 819.89, 819.89, 819.89, 819.89, 819.89, 821.56, 821.56, 821.56, 821.56, 821.56, 821.63, 821.63, 821.63, 821.63, 821.63, 822.83, 822.83, 822.83, 822.83, 822.83, 824.77, 824.77, 824.77, 824.77, 824.77, 827.58, 827.58, 827.58, 827.58, 827.58, 827.58]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 34.71, 34.71, 34.71, 34.71, 34.71, 33.68, 33.68, 33.68, 33.68, 33.68, 29.46, 29.46, 29.46, 29.46, 29.46, 32.28, 32.28, 32.28, 32.28, 32.28, 32.38, 32.38, 32.38, 32.38, 32.38, 33.43, 33.43, 33.43, 33.43, 33.43, 33.9, 33.9, 33.9, 33.9, 33.9, 34.26, 34.26, 34.26, 34.26, 34.26, 34.42, 34.42, 34.42, 34.42, 34.42, 34.32, 34.32, 34.32, 34.32, 34.32, 34.18, 34.18, 34.18, 34.18, 34.18, 33.77, 33.77, 33.77, 33.77, 33.77, 33.55, 33.55, 33.55, 33.55, 33.55, 33.2, 33.2, 33.2, 33.2, 33.2, 31.67, 31.67, 31.67, 31.67, 31.67, 30.11, 30.11, 30.11, 30.11, 30.11, 29.94, 29.94, 29.94, 29.94, 29.94, 30.13, 30.13, 30.13, 30.13, 30.13, 30.46, 30.46, 30.46, 30.46, 30.46, 30.31, 30.31, 30.31, 30.31, 30.31, 29.59, 29.59, 29.59, 29.59, 29.59, 29.35, 29.35, 29.35, 29.35, 29.35, 29.28, 29.28, 29.28, 29.28, 29.28, 29.35, 29.35, 29.35, 29.35, 29.35, 29.19, 29.19, 29.19, 29.19, 29.19, 29.5, 29.5, 29.5, 29.5, 29.5, 29.56, 29.56, 29.56, 29.56, 29.56, 29.65, 29.65, 29.65, 29.65, 29.65, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.7, 29.7, 29.7, 29.7, 29.7, 29.85, 29.85, 29.85, 29.85, 29.85, 29.96, 29.96, 29.96, 29.96, 29.96, 30.18, 30.18, 30.18, 30.18, 30.18, 30.26, 30.26, 30.26, 30.26, 30.26, 30.06, 30.06, 30.06, 30.06, 30.06, 29.98, 29.98, 29.98, 29.98, 29.98, 29.97, 29.97, 29.97, 29.97, 29.97, 29.8, 29.8, 29.8, 29.8, 29.8, 30.01, 30.01, 30.01, 30.01, 30.01, 30.19, 30.19, 30.19, 30.19, 30.19, 30.3, 30.3, 30.3, 30.3, 30.3, 30.39, 30.39, 30.39, 30.39, 30.39, 30.1, 30.1, 30.1, 30.1, 30.1, 29.94, 29.94, 29.94, 29.94, 29.94, 29.27, 29.27, 29.27, 29.27, 29.27, 28.71, 28.71, 28.71, 28.71, 28.71, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.75, 28.75, 28.75, 28.75, 28.75, 28.77, 28.77, 28.77, 28.77, 28.77, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.81, 28.81, 28.81, 28.81, 28.81, 28.86, 28.86, 28.86, 28.86, 28.86, 28.91, 28.91, 28.91, 28.91, 28.91, 29.03, 29.03, 29.03, 29.03, 29.03, 29.21, 29.21, 29.21, 29.21, 29.21, 29.28, 29.28, 29.28, 29.28, 29.28, 29.34, 29.34, 29.34, 29.34, 29.34, 29.35]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.03, 0.03, 0.03, 0.03, 0.03, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.36, 0.36, 0.36, 0.36, 0.36, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.37, 0.37, 0.37, 0.37, 0.37, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.3, 0.3, 0.3, 0.3, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.48, 0.48, 0.48, 0.48, 0.48, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.31]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0]

teleprint-me · 2024-05-08T01:51:51Z

Possible regression with llama-spm?

      Start 11: test-tokenizer-1-llama-spm
11/24 Test #11: test-tokenizer-1-llama-spm .......Subprocess aborted***Exception:   0.86 sec

Looking into it.

Signed-off-by: teleprint-me <[email protected]>

CISC · 2024-05-08T20:48:02Z

@teleprint-me Should not qwen be removed since #7114 got merged? They are the same.

teleprint-me · 2024-05-08T21:29:35Z

@CISC

I had already committed it and was waiting to be able to merge the PRs into this branch. I can remove it after some testing if it really isn't needed, but the pattern is useful to see the potential variety of implementations that we'll be dealing with. The original Qwen repos use BPE with tiktoken and a build script which is hooked into transformers tokenizer. While the vocab might be the same, the process for getting it is completely different. It's good to know about it, even if the tokenizer itself isn't useful in this context.

…ov#7273)

…5615) Co-authored-by: Brian <[email protected]>

ref: ggerganov#7293

This can be overridden with the -m command line option ref: ggerganov#7293

…#7284)" (ggerganov#7334) This reverts commit 583fd6b.

Signed-off-by: teleprint-me <[email protected]>

teleprint-me · 2024-05-17T07:35:12Z

Hm. I messed up the merge. Usually auto rebases downstream merges. That's no good.

aahouzi · 2024-05-17T15:59:24Z

@teleprint-me your PR is not working for stablelm models, why the pre-tokenizer is stablelm if I remember correctly it should be gpt-2 right ?

teleprint-me · 2024-05-17T16:44:30Z

@aahouzi Yes, you're correct. I think I know why this is happening.

I'm wondering if it's because of the _set_vocab_gpt2 method.

    def _set_vocab_gpt2(self) -> None:
        tokens, toktypes, tokpre = self.get_vocab_base()
        self.gguf_writer.add_tokenizer_model("gpt2")
        self.gguf_writer.add_tokenizer_pre(tokpre)
        self.gguf_writer.add_token_list(tokens)
        self.gguf_writer.add_token_types(toktypes)

add_tokenizer_pre is set, but tokpre is a variable that's passed to it.

I fixed phi-2 by directly adding self.gguf_writer.add_tokenizer_pre("gpt-2") to it, unsure if this should be changed here though.

Then override on a model-by-model basis as needed might be a better approach?

Need feedback on how this might affect other models. @compilade

teleprint-me · 2024-05-17T16:53:41Z

Okay, I get it now. This verifies my initial intuition about how the mapping was setup in the update script. This needs to be refactored somehow. We can't rely on a name as an id for the models pre-tokenizer.

    # used for GPT-2 BPE and WordPiece vocabs
    def get_vocab_base(self) -> tuple[list[str], list[int], str]:
        tokens: list[str] = []
        toktypes: list[int] = []

        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
        vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
        assert max(tokenizer.vocab.values()) < vocab_size

        tokpre = self.get_vocab_base_pre(tokenizer)

        # omitting for brevity

        return tokens, toktypes, tokpre

The models vocab may be modified downstream and conversions will fail as a result even if the architecture is clearly supported. This is creating name conflicts. Turns out the phi-2 bug in #7219 and #7300 is a symptom of a more deeply rooted issue. @ggerganov

teleprint-me · 2024-05-17T16:57:50Z

I think I can fully automate this entire process and reduce the complexity. Not sure yet. Need to experiment.

compilade · 2024-05-17T17:42:58Z

I fixed phi-2 by directly adding self.gguf_writer.add_tokenizer_pre("gpt-2") to it, unsure if this should be changed here though.

Phi-2 has this pre-tokenizer (from its tokenizer.json):

  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  }

The "use_regex" here means to use the GPT-2 regex, so using the gpt-2 is correct.

Then override on a model-by-model basis as needed might be a better approach?

Hmm, no I don't think this will work for architectures which are used with different pre-tokenizers. For example, the StableLMModel uses at least 2 different pre-tokenizers:

StableLM 3b uses GPT-2's regex:

"pre_tokenizer": {
  "type": "ByteLevel",
  "add_prefix_space": false,
  "trim_offsets": true,
  "use_regex": true
}

StableLM2 1.6B uses something else, which is similar (but not quite) the same as Llama 3:

"pre_tokenizer": {
  "type": "Sequence",
  "pretokenizers": [
    {
      "type": "Split",
      "pattern": {
        "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
      },
      "behavior": "Removed",
      "invert": true
    },
    {
      "type": "ByteLevel",
      "add_prefix_space": false,
      "trim_offsets": true,
      "use_regex": false
    }
  ]
}

I think I can fully automate this entire process and reduce the complexity. Not sure yet. Need to experiment.

Yes, this is definitely possible. A starting point would be to compare the pre_tokenizer field of all the tokenizer.json files fetched by convert-hf-to-gguf-update.py, then figure out which ones are the same, find a way to normalize the same-but-different ones (like \r\n vs \\r\\n in the regex).

The question probably is then: can pre-tokenizers entirely be identified by the pre_tokenizer field from tokenizer.json?

(EDIT: maybe this would be problematic with some models which use a custom tokenizer like in _set_vocab_qwen with trust_remote_code=True... In that case a pre-tokenizer could be hardcoded for these models, maybe.)

aahouzi · 2024-05-17T18:29:30Z

@compilade for StableLM3B, the picked hash is from olmo's pre-tokenizer

teleprint-me · 2024-05-17T18:49:54Z

maybe this would be problematic with some models which use a custom tokenizer like in _set_vocab_qwen with trust_remote_code=True... In that case a pre-tokenizer could be hardcoded for these models, maybe.

@compilade Yeah, I tried this already, and it proved to be incredibly complicated. Many of us have already come to the conclusion that there is no reliable way to do this, so I'm thinking maybe we lean into that instead of veering away from it. A weakness can be utilized as a strength just as a perceived strength can be a weakness.

for StableLM3B, the picked hash is from olmo's pre-tokenizer

@aahouzi I didn't implement it yet. I've been observing PRs, attempting to identify a useful pattern.

compilade · 2024-05-17T20:47:59Z

@compilade for StableLM3B, the picked hash is from olmo's pre-tokenizer

@aahouzi Which seems appropriate, considering OLMo also uses GPT-2's regex for its pre-tokenizer:

From the pre_tokenizer field of OLMo in https://huggingface.co/allenai/OLMo-7B-Instruct/raw/main/tokenizer.json:

"pre_tokenizer": {
  "type": "ByteLevel",
  "add_prefix_space": false,
  "trim_offsets": true,
  "use_regex": true
}

And OLMo uses GPT-2's regex in llama.cpp:

llama.cpp/llama.cpp

Lines 12351 to 12356 in 0fc1e82

 case LLAMA_VOCAB_PRE_TYPE_GPT2: 

 case LLAMA_VOCAB_PRE_TYPE_OLMO: 

 word_collection = unicode_regex_split(text, { 

 "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 

 }); 

 break;

teleprint-me · 2024-05-18T02:54:08Z

I think I figured out how to automate the tokenizer, model, checksum, and conversions all in one go. Will close this PR and open a new PR when I'm ready to post for huggingface related tasks.

ggerganov · 2024-05-18T07:46:00Z

The question probably is then: can pre-tokenizers entirely be identified by the pre_tokenizer field from tokenizer.json?

There is also the "normalizer" section that I think plays some role in some tokenizers - seems to be mainly utilized for embeddings models.

teleprint-me · 2024-05-18T19:03:44Z

There is also the "normalizer" section that I think plays some role in some tokenizers - seems to be mainly utilized for embeddings models.

The issue is determining what the normalizer is. Can we just assume it's NFD? It seems to have been working so far. Specify something else if it isn't?

The metadata for the tokenizer.json itself is inconsistent. It seems that the metadata in aggregate is more useful than it is individually, but even then, there are still missing pieces of information.

For example, the llama bpe normalizer is defined as "normalizer": null, and ends up not providing any useful information. The llama spm normalizer is not null and is defined as "type": "Sequence".

Same for other types of related metadata such as added tokens, special tokens, model type, etc.

It seems that the AutoTokenizer will have most of the relevant information necessary already if it is available. The caveat is that this varies and is dependent upon a case-by-case basis.

teleprint-me · 2024-05-18T20:09:59Z

Hm.

>>> tokenizer.backend_tokenizer.normalizer
>>> type(tokenizer.backend_tokenizer.normalizer)
<class 'NoneType'>
>>> from tokenizers import normalizers
>>> from tokenizers.normalizers import NFD, StripAccents
>>> normalizer = normalizers.Sequence([NFD(), StripAccents()])
>>> tokenizer.backend_tokenizer.normalizer = normalizer
>>> tokenizer.backend_tokenizer.normalizer
<tokenizers.normalizers.Sequence object at 0x76840738e1f0>
>>> tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?")
'Hello how are u?'

Not sure how reliable this would be? Plus, this is only really needed for the conversion? The vocab is already pre-existing, extracted, and then written to the model file. So, what's the plan here?

I guess it depends on what we're looking to do.

Does the text need to be normalized, e.g. "cleaned".
Does the text need to be pre-tokenized, e.g. "split upon boundries"
Is the models integrity intact, e.g. "hashsum"
Determine the models tokenizer type, e.g. "BPE"

These are just some off-the-cuff check boxes.

TBH, it's confusing because the tokenizer.ggml.pre should be tokenizer.ggml.type instead which would more clearly communicate what we're looking for.

e.g., What kind of tokenizer does the model depend on?

I'm just rolling with it right now.

The remaining metadata really just depends on the intent and purpose, most of which is already utilized.

teleprint-me · 2024-05-19T02:57:00Z

Superseded by PR #7379

teleprint-me added 4 commits May 7, 2024 01:43

feat: Add stablelm vocab to gguf update

1a9cf92

chore: Apply update to get_vocab_base_pre method

1355c24

feat: Add stablelm vocab

e71789e

feat: Add generate vocab shell script

8490705

teleprint-me mentioned this pull request May 7, 2024

Add BPE pre-tokenization for Qwen2. #7114

Merged

Galunid requested changes May 7, 2024

View reviewed changes

ggerganov marked this pull request as draft May 7, 2024 17:49

teleprint-me changed the title ~~chore: Add stablelm vocab~~ chore: Add model vocab support May 7, 2024

teleprint-me added 6 commits May 7, 2024 21:16

refactor: Clean up and organize url and dir paths

d8694af

tests: Add test for qwen tokenizer

9d2fcd0

feat: Add qwen pattern and tokenizer

b8f8a96

chore: Add missing command-r gguf vocab

3ae6c17

feat: Add support for qwen tokenizer

4155e86

Signed-off-by: teleprint-me <[email protected]>

chore: Update generate-vocab.sh script

cbfed5b

Signed-off-by: teleprint-me <[email protected]>

teleprint-me added 9 commits May 7, 2024 21:56

note: Time of check to time of use

f7dda38

fix: Attempt to remove potential TOCTOU

670e1c3

fix: Apply proper paths for handling qwen

69efb59

fix: Apply fix to generate-vocab.sh script

906c3f7

Signed-off-by: teleprint-me <[email protected]>

chore: Add tiktoken to convert requirements

0478552

Signed-off-by: teleprint-me <[email protected]>

chore: Add model vocab

ccafb87

Signed-off-by: teleprint-me <[email protected]>

Merge branch 'master' into add-stablelm-hash

a6c5d5d

chore: Group qwen models together

ca8acea

chore: Fix enumeration for qwen, olmo, and dbrx

c05d2a2

teleprint-me added 2 commits May 8, 2024 17:49

patch: Apply patch to fix config and SPM retrieval

17f2243

patch: Apply fix for downloading related model files

de3d9e3

GermanAizek and others added 6 commits May 16, 2024 22:55

grammar, json, llama: replace push on emplace if it possible (ggergan…

d0a9c31

…ov#7273)

convert : get general.name from model dir, not its parent (ggerganov#…

c7a926f

…5615) Co-authored-by: Brian <[email protected]>

rpc : add command line arg for specifying backend memory

3d210da

ref: ggerganov#7293

rpc : get available mem for the CPU backend

99d5b28

This can be overridden with the -m command line option ref: ggerganov#7293

Revert "server bench: fix bench not waiting for model load (ggerganov…

657f980

…#7284)" (ggerganov#7334) This reverts commit 583fd6b.

[Server] Added --verbose option to README [no ci] (ggerganov#7335)

cd0e3d5

teleprint-me mentioned this pull request May 17, 2024

convert-hf-to-gguf.py breaks on phi-2 #7219

Open

teleprint-me added 4 commits May 17, 2024 03:18

patch: Add pre-tokenizer metadata to phi-2

e7c7ae8

Signed-off-by: teleprint-me <[email protected]>

patch: Fix jina vocab generation

9a81faf

Signed-off-by: teleprint-me <[email protected]>

feat: Make number of experts configurable

8aa4937

Signed-off-by: teleprint-me <[email protected]>

chore: Update gguf vocabularies

a7e0042

Signed-off-by: teleprint-me <[email protected]>

teleprint-me mentioned this pull request May 17, 2024

Add phi-2 tokenizer #7300

Open

Merge branch 'master' into add-stablelm-hash

9269594

teleprint-me mentioned this pull request May 19, 2024

Automate vocab support and model conversion #7379

Draft

7 tasks

teleprint-me closed this May 19, 2024

teleprint-me deleted the add-stablelm-hash branch May 20, 2024 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add model vocab support #7117

chore: Add model vocab support #7117

teleprint-me commented May 7, 2024 •

edited

Galunid left a comment

teleprint-me commented May 7, 2024

github-actions bot commented May 8, 2024 •

edited

teleprint-me commented May 8, 2024 •

edited

CISC commented May 8, 2024

teleprint-me commented May 8, 2024 •

edited

teleprint-me commented May 17, 2024 •

edited

aahouzi commented May 17, 2024

teleprint-me commented May 17, 2024 •

edited

teleprint-me commented May 17, 2024 •

edited

teleprint-me commented May 17, 2024

compilade commented May 17, 2024 •

edited

aahouzi commented May 17, 2024

teleprint-me commented May 17, 2024 •

edited

compilade commented May 17, 2024

teleprint-me commented May 18, 2024

ggerganov commented May 18, 2024

teleprint-me commented May 18, 2024

teleprint-me commented May 18, 2024 •

edited

teleprint-me commented May 19, 2024

chore: Add model vocab support #7117

chore: Add model vocab support #7117

Conversation

teleprint-me commented May 7, 2024 • edited

Galunid left a comment

Choose a reason for hiding this comment

teleprint-me commented May 7, 2024

github-actions bot commented May 8, 2024 • edited

teleprint-me commented May 8, 2024 • edited

CISC commented May 8, 2024

teleprint-me commented May 8, 2024 • edited

teleprint-me commented May 17, 2024 • edited

aahouzi commented May 17, 2024

teleprint-me commented May 17, 2024 • edited

teleprint-me commented May 17, 2024 • edited

teleprint-me commented May 17, 2024

compilade commented May 17, 2024 • edited

aahouzi commented May 17, 2024

teleprint-me commented May 17, 2024 • edited

compilade commented May 17, 2024

teleprint-me commented May 18, 2024

ggerganov commented May 18, 2024

teleprint-me commented May 18, 2024

teleprint-me commented May 18, 2024 • edited

teleprint-me commented May 19, 2024

teleprint-me commented May 7, 2024 •

edited

github-actions bot commented May 8, 2024 •

edited

teleprint-me commented May 8, 2024 •

edited

teleprint-me commented May 8, 2024 •

edited

teleprint-me commented May 17, 2024 •

edited

teleprint-me commented May 17, 2024 •

edited

teleprint-me commented May 17, 2024 •

edited

compilade commented May 17, 2024 •

edited

teleprint-me commented May 17, 2024 •

edited

teleprint-me commented May 18, 2024 •

edited