Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add model vocab support #7117

Closed
wants to merge 70 commits into from

Conversation

teleprint-me
Copy link
Contributor

@teleprint-me teleprint-me commented May 7, 2024

ref #6920 comment

supersedes #7018

Adds the following models:

  • phi
  • stablelm
  • qwen (qwen2 supersedes qwen)
  • mistral
  • mixtral

Adds the following extras:

  • Adds the stablelm vocab
  • Adds the generate-vocab.sh script
  • Adds ability to generate the generate-vocab.sh script

Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this work in progress? It seems to miss implementation in llama.cpp. Those tokenizers won't be recognized leading to crash with runtime error.

@teleprint-me
Copy link
Contributor Author

@Galunid Yes, it's still a work in progress. I was passing out while still implementing because I wanted to get it out of the way, so I decided to pause for a bit.

@ggerganov ggerganov marked this pull request as draft May 7, 2024 17:49
@teleprint-me teleprint-me changed the title chore: Add stablelm vocab chore: Add model vocab support May 7, 2024
Copy link
Contributor

github-actions bot commented May 8, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8559.1ms p(95)=21749.4ms fails=, finish reason: stop=492 truncated=55
  • Prompt processing (pp): avg=102.49tk/s p(95)=412.6tk/s
  • Token generation (tg): avg=34.05tk/s p(95)=46.61tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=add-stablelm-hash commit=9269594919bb9952b176c70606185f805a932ed7

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 648.31, 648.31, 648.31, 648.31, 648.31, 916.41, 916.41, 916.41, 916.41, 916.41, 904.46, 904.46, 904.46, 904.46, 904.46, 907.57, 907.57, 907.57, 907.57, 907.57, 903.61, 903.61, 903.61, 903.61, 903.61, 922.45, 922.45, 922.45, 922.45, 922.45, 906.52, 906.52, 906.52, 906.52, 906.52, 906.88, 906.88, 906.88, 906.88, 906.88, 913.23, 913.23, 913.23, 913.23, 913.23, 925.09, 925.09, 925.09, 925.09, 925.09, 919.09, 919.09, 919.09, 919.09, 919.09, 933.9, 933.9, 933.9, 933.9, 933.9, 883.78, 883.78, 883.78, 883.78, 883.78, 879.68, 879.68, 879.68, 879.68, 879.68, 898.41, 898.41, 898.41, 898.41, 898.41, 899.02, 899.02, 899.02, 899.02, 899.02, 900.54, 900.54, 900.54, 900.54, 900.54, 898.64, 898.64, 898.64, 898.64, 898.64, 894.84, 894.84, 894.84, 894.84, 894.84, 904.9, 904.9, 904.9, 904.9, 904.9, 903.2, 903.2, 903.2, 903.2, 903.2, 899.81, 899.81, 899.81, 899.81, 899.81, 903.44, 903.44, 903.44, 903.44, 903.44, 901.0, 901.0, 901.0, 901.0, 901.0, 903.8, 903.8, 903.8, 903.8, 903.8, 871.43, 871.43, 871.43, 871.43, 871.43, 874.13, 874.13, 874.13, 874.13, 874.13, 875.33, 875.33, 875.33, 875.33, 875.33, 886.67, 886.67, 886.67, 886.67, 886.67, 882.93, 882.93, 882.93, 882.93, 882.93, 879.11, 879.11, 879.11, 879.11, 879.11, 879.19, 879.19, 879.19, 879.19, 879.19, 883.37, 883.37, 883.37, 883.37, 883.37, 881.27, 881.27, 881.27, 881.27, 881.27, 880.54, 880.54, 880.54, 880.54, 880.54, 887.64, 887.64, 887.64, 887.64, 887.64, 883.16, 883.16, 883.16, 883.16, 883.16, 886.99, 886.99, 886.99, 886.99, 886.99, 879.96, 879.96, 879.96, 879.96, 879.96, 878.33, 878.33, 878.33, 878.33, 878.33, 876.78, 876.78, 876.78, 876.78, 876.78, 879.32, 879.32, 879.32, 879.32, 879.32, 878.86, 878.86, 878.86, 878.86, 878.86, 888.15, 888.15, 888.15, 888.15, 888.15, 882.37, 882.37, 882.37, 882.37, 882.37, 830.06, 830.06, 830.06, 830.06, 830.06, 828.48, 828.48, 828.48, 828.48, 828.48, 826.01, 826.01, 826.01, 826.01, 826.01, 821.42, 821.42, 821.42, 821.42, 821.42, 823.23, 823.23, 823.23, 823.23, 823.23, 825.83, 825.83, 825.83, 825.83, 825.83, 828.57, 828.57, 828.57, 828.57, 828.57, 831.63, 831.63, 831.63, 831.63, 831.63, 835.7, 835.7, 835.7, 835.7, 835.7, 835.51, 835.51, 835.51, 835.51, 835.51, 819.89, 819.89, 819.89, 819.89, 819.89, 821.56, 821.56, 821.56, 821.56, 821.56, 821.63, 821.63, 821.63, 821.63, 821.63, 822.83, 822.83, 822.83, 822.83, 822.83, 824.77, 824.77, 824.77, 824.77, 824.77, 827.58, 827.58, 827.58, 827.58, 827.58, 827.58]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 34.71, 34.71, 34.71, 34.71, 34.71, 33.68, 33.68, 33.68, 33.68, 33.68, 29.46, 29.46, 29.46, 29.46, 29.46, 32.28, 32.28, 32.28, 32.28, 32.28, 32.38, 32.38, 32.38, 32.38, 32.38, 33.43, 33.43, 33.43, 33.43, 33.43, 33.9, 33.9, 33.9, 33.9, 33.9, 34.26, 34.26, 34.26, 34.26, 34.26, 34.42, 34.42, 34.42, 34.42, 34.42, 34.32, 34.32, 34.32, 34.32, 34.32, 34.18, 34.18, 34.18, 34.18, 34.18, 33.77, 33.77, 33.77, 33.77, 33.77, 33.55, 33.55, 33.55, 33.55, 33.55, 33.2, 33.2, 33.2, 33.2, 33.2, 31.67, 31.67, 31.67, 31.67, 31.67, 30.11, 30.11, 30.11, 30.11, 30.11, 29.94, 29.94, 29.94, 29.94, 29.94, 30.13, 30.13, 30.13, 30.13, 30.13, 30.46, 30.46, 30.46, 30.46, 30.46, 30.31, 30.31, 30.31, 30.31, 30.31, 29.59, 29.59, 29.59, 29.59, 29.59, 29.35, 29.35, 29.35, 29.35, 29.35, 29.28, 29.28, 29.28, 29.28, 29.28, 29.35, 29.35, 29.35, 29.35, 29.35, 29.19, 29.19, 29.19, 29.19, 29.19, 29.5, 29.5, 29.5, 29.5, 29.5, 29.56, 29.56, 29.56, 29.56, 29.56, 29.65, 29.65, 29.65, 29.65, 29.65, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.7, 29.7, 29.7, 29.7, 29.7, 29.85, 29.85, 29.85, 29.85, 29.85, 29.96, 29.96, 29.96, 29.96, 29.96, 30.18, 30.18, 30.18, 30.18, 30.18, 30.26, 30.26, 30.26, 30.26, 30.26, 30.06, 30.06, 30.06, 30.06, 30.06, 29.98, 29.98, 29.98, 29.98, 29.98, 29.97, 29.97, 29.97, 29.97, 29.97, 29.8, 29.8, 29.8, 29.8, 29.8, 30.01, 30.01, 30.01, 30.01, 30.01, 30.19, 30.19, 30.19, 30.19, 30.19, 30.3, 30.3, 30.3, 30.3, 30.3, 30.39, 30.39, 30.39, 30.39, 30.39, 30.1, 30.1, 30.1, 30.1, 30.1, 29.94, 29.94, 29.94, 29.94, 29.94, 29.27, 29.27, 29.27, 29.27, 29.27, 28.71, 28.71, 28.71, 28.71, 28.71, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.75, 28.75, 28.75, 28.75, 28.75, 28.77, 28.77, 28.77, 28.77, 28.77, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.85, 28.81, 28.81, 28.81, 28.81, 28.81, 28.86, 28.86, 28.86, 28.86, 28.86, 28.91, 28.91, 28.91, 28.91, 28.91, 29.03, 29.03, 29.03, 29.03, 29.03, 29.21, 29.21, 29.21, 29.21, 29.21, 29.28, 29.28, 29.28, 29.28, 29.28, 29.34, 29.34, 29.34, 29.34, 29.34, 29.35]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.03, 0.03, 0.03, 0.03, 0.03, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.36, 0.36, 0.36, 0.36, 0.36, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.37, 0.37, 0.37, 0.37, 0.37, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.3, 0.3, 0.3, 0.3, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.48, 0.48, 0.48, 0.48, 0.48, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.31]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715989368 --> 1715989998
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0]
                    

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 8, 2024

Possible regression with llama-spm?

      Start 11: test-tokenizer-1-llama-spm
11/24 Test #11: test-tokenizer-1-llama-spm .......Subprocess aborted***Exception:   0.86 sec

Looking into it.

@CISC
Copy link
Contributor

CISC commented May 8, 2024

@teleprint-me Should not qwen be removed since #7114 got merged? They are the same.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 8, 2024

@CISC

I had already committed it and was waiting to be able to merge the PRs into this branch. I can remove it after some testing if it really isn't needed, but the pattern is useful to see the potential variety of implementations that we'll be dealing with. The original Qwen repos use BPE with tiktoken and a build script which is hooked into transformers tokenizer. While the vocab might be the same, the process for getting it is completely different. It's good to know about it, even if the tokenizer itself isn't useful in this context.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 17, 2024

Hm. I messed up the merge. Usually auto rebases downstream merges. That's no good.

@aahouzi
Copy link
Contributor

aahouzi commented May 17, 2024

@teleprint-me your PR is not working for stablelm models, why the pre-tokenizer is stablelm if I remember correctly it should be gpt-2 right ?

image

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 17, 2024

@aahouzi Yes, you're correct. I think I know why this is happening.

I'm wondering if it's because of the _set_vocab_gpt2 method.

    def _set_vocab_gpt2(self) -> None:
        tokens, toktypes, tokpre = self.get_vocab_base()
        self.gguf_writer.add_tokenizer_model("gpt2")
        self.gguf_writer.add_tokenizer_pre(tokpre)
        self.gguf_writer.add_token_list(tokens)
        self.gguf_writer.add_token_types(toktypes)

add_tokenizer_pre is set, but tokpre is a variable that's passed to it.

I fixed phi-2 by directly adding self.gguf_writer.add_tokenizer_pre("gpt-2") to it, unsure if this should be changed here though.

Then override on a model-by-model basis as needed might be a better approach?

Need feedback on how this might affect other models. @compilade

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 17, 2024

Okay, I get it now. This verifies my initial intuition about how the mapping was setup in the update script. This needs to be refactored somehow. We can't rely on a name as an id for the models pre-tokenizer.

    # used for GPT-2 BPE and WordPiece vocabs
    def get_vocab_base(self) -> tuple[list[str], list[int], str]:
        tokens: list[str] = []
        toktypes: list[int] = []

        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
        vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
        assert max(tokenizer.vocab.values()) < vocab_size

        tokpre = self.get_vocab_base_pre(tokenizer)

        # omitting for brevity

        return tokens, toktypes, tokpre

The models vocab may be modified downstream and conversions will fail as a result even if the architecture is clearly supported. This is creating name conflicts. Turns out the phi-2 bug in #7219 and #7300 is a symptom of a more deeply rooted issue. @ggerganov

@teleprint-me
Copy link
Contributor Author

I think I can fully automate this entire process and reduce the complexity. Not sure yet. Need to experiment.

@compilade
Copy link
Collaborator

compilade commented May 17, 2024

I fixed phi-2 by directly adding self.gguf_writer.add_tokenizer_pre("gpt-2") to it, unsure if this should be changed here though.

Phi-2 has this pre-tokenizer (from its tokenizer.json):

  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  }

The "use_regex" here means to use the GPT-2 regex, so using the gpt-2 is correct.

Then override on a model-by-model basis as needed might be a better approach?

Hmm, no I don't think this will work for architectures which are used with different pre-tokenizers. For example, the StableLMModel uses at least 2 different pre-tokenizers:

  • StableLM 3b uses GPT-2's regex:
    "pre_tokenizer": {
      "type": "ByteLevel",
      "add_prefix_space": false,
      "trim_offsets": true,
      "use_regex": true
    }
  • StableLM2 1.6B uses something else, which is similar (but not quite) the same as Llama 3:
    "pre_tokenizer": {
      "type": "Sequence",
      "pretokenizers": [
        {
          "type": "Split",
          "pattern": {
            "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
          },
          "behavior": "Removed",
          "invert": true
        },
        {
          "type": "ByteLevel",
          "add_prefix_space": false,
          "trim_offsets": true,
          "use_regex": false
        }
      ]
    }

I think I can fully automate this entire process and reduce the complexity. Not sure yet. Need to experiment.

Yes, this is definitely possible. A starting point would be to compare the pre_tokenizer field of all the tokenizer.json files fetched by convert-hf-to-gguf-update.py, then figure out which ones are the same, find a way to normalize the same-but-different ones (like \r\n vs \\r\\n in the regex).

The question probably is then: can pre-tokenizers entirely be identified by the pre_tokenizer field from tokenizer.json?

(EDIT: maybe this would be problematic with some models which use a custom tokenizer like in _set_vocab_qwen with trust_remote_code=True... In that case a pre-tokenizer could be hardcoded for these models, maybe.)

@aahouzi
Copy link
Contributor

aahouzi commented May 17, 2024

@compilade for StableLM3B, the picked hash is from olmo's pre-tokenizer

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 17, 2024

maybe this would be problematic with some models which use a custom tokenizer like in _set_vocab_qwen with trust_remote_code=True... In that case a pre-tokenizer could be hardcoded for these models, maybe.

@compilade Yeah, I tried this already, and it proved to be incredibly complicated. Many of us have already come to the conclusion that there is no reliable way to do this, so I'm thinking maybe we lean into that instead of veering away from it. A weakness can be utilized as a strength just as a perceived strength can be a weakness.

for StableLM3B, the picked hash is from olmo's pre-tokenizer

@aahouzi I didn't implement it yet. I've been observing PRs, attempting to identify a useful pattern.

@compilade
Copy link
Collaborator

@compilade for StableLM3B, the picked hash is from olmo's pre-tokenizer

@aahouzi Which seems appropriate, considering OLMo also uses GPT-2's regex for its pre-tokenizer:

From the pre_tokenizer field of OLMo in https://huggingface.co/allenai/OLMo-7B-Instruct/raw/main/tokenizer.json:

"pre_tokenizer": {
  "type": "ByteLevel",
  "add_prefix_space": false,
  "trim_offsets": true,
  "use_regex": true
}

And OLMo uses GPT-2's regex in llama.cpp:

llama.cpp/llama.cpp

Lines 12351 to 12356 in 0fc1e82

case LLAMA_VOCAB_PRE_TYPE_GPT2:
case LLAMA_VOCAB_PRE_TYPE_OLMO:
word_collection = unicode_regex_split(text, {
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
});
break;

@teleprint-me
Copy link
Contributor Author

I think I figured out how to automate the tokenizer, model, checksum, and conversions all in one go. Will close this PR and open a new PR when I'm ready to post for huggingface related tasks.

@ggerganov
Copy link
Owner

The question probably is then: can pre-tokenizers entirely be identified by the pre_tokenizer field from tokenizer.json?

There is also the "normalizer" section that I think plays some role in some tokenizers - seems to be mainly utilized for embeddings models.

@teleprint-me
Copy link
Contributor Author

There is also the "normalizer" section that I think plays some role in some tokenizers - seems to be mainly utilized for embeddings models.

The issue is determining what the normalizer is. Can we just assume it's NFD? It seems to have been working so far. Specify something else if it isn't?

The metadata for the tokenizer.json itself is inconsistent. It seems that the metadata in aggregate is more useful than it is individually, but even then, there are still missing pieces of information.

For example, the llama bpe normalizer is defined as "normalizer": null, and ends up not providing any useful information. The llama spm normalizer is not null and is defined as "type": "Sequence".

Same for other types of related metadata such as added tokens, special tokens, model type, etc.

It seems that the AutoTokenizer will have most of the relevant information necessary already if it is available. The caveat is that this varies and is dependent upon a case-by-case basis.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 18, 2024

Hm.

>>> tokenizer.backend_tokenizer.normalizer
>>> type(tokenizer.backend_tokenizer.normalizer)
<class 'NoneType'>
>>> from tokenizers import normalizers
>>> from tokenizers.normalizers import NFD, StripAccents
>>> normalizer = normalizers.Sequence([NFD(), StripAccents()])
>>> tokenizer.backend_tokenizer.normalizer = normalizer
>>> tokenizer.backend_tokenizer.normalizer
<tokenizers.normalizers.Sequence object at 0x76840738e1f0>
>>> tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?")
'Hello how are u?'

Not sure how reliable this would be? Plus, this is only really needed for the conversion? The vocab is already pre-existing, extracted, and then written to the model file. So, what's the plan here?

I guess it depends on what we're looking to do.

  • Does the text need to be normalized, e.g. "cleaned".
  • Does the text need to be pre-tokenized, e.g. "split upon boundries"
  • Is the models integrity intact, e.g. "hashsum"
  • Determine the models tokenizer type, e.g. "BPE"

These are just some off-the-cuff check boxes.

TBH, it's confusing because the tokenizer.ggml.pre should be tokenizer.ggml.type instead which would more clearly communicate what we're looking for.

e.g., What kind of tokenizer does the model depend on?

I'm just rolling with it right now.

The remaining metadata really just depends on the intent and purpose, most of which is already utilized.

@teleprint-me
Copy link
Contributor Author

Superseded by PR #7379

@teleprint-me teleprint-me deleted the add-stablelm-hash branch May 20, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet