Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chatglm3-6b model support [help wanted] #6999

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mnlife
Copy link

@mnlife mnlife commented Apr 30, 2024

Text generation has been implemented.

The following features that I know of haven't been implemented.
Compare with PyTorch version:

  • The model input prefixes not include {"[gMASK]", "sop", "<|user|>", "_", "<0x0A>"}, while the suffix not includes {"<|assistant|>"}
    • for example: when we input "hi", after tokenizer it is {"[gMASK]", "sop", "<|user|>", "_", "<0x0A>", "hi", "<|assistant|>"}. Implement this feature, what we need to changed in llama.cpp?
    • when I add 9a8db6b, and exec below command,The changes have not taken effect.
./build/bin/main -m ~/models/chatglm3-6b-Q4_K_M.gguf --verbose-prompt -p 你好
  • The inference results are incorrect with the CUDA version.

below is some link about chatglm model
The Hugging Face model path for chatglm3-6b: https://huggingface.co/THUDM/chatglm3-6b
gguf model: https://modelscope.cn/api/v1/models/mnlife/chatglm3-6b-gguf/repo?Revision=master&FilePath=chatglm3-6b-Q4_K_M.gguf

Copy link
Contributor

github-actions bot commented Apr 30, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 544 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8601.62ms p(95)=21255.69ms fails=, finish reason: stop=497 truncated=47
  • Prompt processing (pp): avg=99.69tk/s p(95)=440.7tk/s
  • Token generation (tg): avg=36.76tk/s p(95)=48.73tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=chatglm3 commit=f4a6c2fe271b4937e118d66517bde1f5e706ba24

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716538966 --> 1716539596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 313.34, 313.34, 313.34, 313.34, 313.34, 528.85, 528.85, 528.85, 528.85, 528.85, 567.52, 567.52, 567.52, 567.52, 567.52, 586.04, 586.04, 586.04, 586.04, 586.04, 664.08, 664.08, 664.08, 664.08, 664.08, 686.47, 686.47, 686.47, 686.47, 686.47, 710.25, 710.25, 710.25, 710.25, 710.25, 731.25, 731.25, 731.25, 731.25, 731.25, 735.25, 735.25, 735.25, 735.25, 735.25, 767.14, 767.14, 767.14, 767.14, 767.14, 789.48, 789.48, 789.48, 789.48, 789.48, 803.87, 803.87, 803.87, 803.87, 803.87, 812.56, 812.56, 812.56, 812.56, 812.56, 819.15, 819.15, 819.15, 819.15, 819.15, 820.97, 820.97, 820.97, 820.97, 820.97, 825.38, 825.38, 825.38, 825.38, 825.38, 826.83, 826.83, 826.83, 826.83, 826.83, 847.41, 847.41, 847.41, 847.41, 847.41, 843.41, 843.41, 843.41, 843.41, 843.41, 844.91, 844.91, 844.91, 844.91, 844.91, 849.54, 849.54, 849.54, 849.54, 849.54, 853.26, 853.26, 853.26, 853.26, 853.26, 853.89, 853.89, 853.89, 853.89, 853.89, 803.03, 803.03, 803.03, 803.03, 803.03, 805.97, 805.97, 805.97, 805.97, 805.97, 808.09, 808.09, 808.09, 808.09, 808.09, 818.4, 818.4, 818.4, 818.4, 818.4, 819.88, 819.88, 819.88, 819.88, 819.88, 818.31, 818.31, 818.31, 818.31, 818.31, 819.95, 819.95, 819.95, 819.95, 819.95, 824.01, 824.01, 824.01, 824.01, 824.01, 821.53, 821.53, 821.53, 821.53, 821.53, 825.72, 825.72, 825.72, 825.72, 825.72, 815.13, 815.13, 815.13, 815.13, 815.13, 819.94, 819.94, 819.94, 819.94, 819.94, 828.79, 828.79, 828.79, 828.79, 828.79, 827.48, 827.48, 827.48, 827.48, 827.48, 824.48, 824.48, 824.48, 824.48, 824.48, 824.91, 824.91, 824.91, 824.91, 824.91, 828.32, 828.32, 828.32, 828.32, 828.32, 829.45, 829.45, 829.45, 829.45, 829.45, 841.01, 841.01, 841.01, 841.01, 841.01, 845.5, 845.5, 845.5, 845.5, 845.5, 845.37, 845.37, 845.37, 845.37, 845.37, 844.6, 844.6, 844.6, 844.6, 844.6, 842.02, 842.02, 842.02, 842.02, 842.02, 845.38, 845.38, 845.38, 845.38, 845.38, 841.75, 841.75, 841.75, 841.75, 841.75, 840.77, 840.77, 840.77, 840.77, 840.77, 843.24, 843.24, 843.24, 843.24, 843.24, 842.97, 842.97, 842.97, 842.97, 842.97, 846.51, 846.51, 846.51, 846.51, 846.51, 847.62, 847.62, 847.62, 847.62, 847.62, 848.31, 848.31, 848.31, 848.31, 848.31, 855.08, 855.08, 855.08, 855.08, 855.08, 854.22, 854.22, 854.22, 854.22, 854.22, 855.05, 855.05, 855.05, 855.05, 855.05, 856.05, 856.05, 856.05, 856.05, 856.05, 855.98, 855.98, 855.98, 855.98, 855.98, 858.74, 858.74, 858.74, 858.74, 858.74, 861.17]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716538966 --> 1716539596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.81, 38.81, 38.81, 38.81, 38.81, 40.77, 40.77, 40.77, 40.77, 40.77, 31.57, 31.57, 31.57, 31.57, 31.57, 31.05, 31.05, 31.05, 31.05, 31.05, 31.92, 31.92, 31.92, 31.92, 31.92, 32.81, 32.81, 32.81, 32.81, 32.81, 34.2, 34.2, 34.2, 34.2, 34.2, 35.0, 35.0, 35.0, 35.0, 35.0, 35.55, 35.55, 35.55, 35.55, 35.55, 35.98, 35.98, 35.98, 35.98, 35.98, 35.56, 35.56, 35.56, 35.56, 35.56, 35.12, 35.12, 35.12, 35.12, 35.12, 35.03, 35.03, 35.03, 35.03, 35.03, 34.16, 34.16, 34.16, 34.16, 34.16, 31.74, 31.74, 31.74, 31.74, 31.74, 30.34, 30.34, 30.34, 30.34, 30.34, 30.49, 30.49, 30.49, 30.49, 30.49, 30.64, 30.64, 30.64, 30.64, 30.64, 30.4, 30.4, 30.4, 30.4, 30.4, 30.42, 30.42, 30.42, 30.42, 30.42, 30.31, 30.31, 30.31, 30.31, 30.31, 30.24, 30.24, 30.24, 30.24, 30.24, 30.48, 30.48, 30.48, 30.48, 30.48, 30.59, 30.59, 30.59, 30.59, 30.59, 30.9, 30.9, 30.9, 30.9, 30.9, 31.12, 31.12, 31.12, 31.12, 31.12, 31.09, 31.09, 31.09, 31.09, 31.09, 31.31, 31.31, 31.31, 31.31, 31.31, 31.38, 31.38, 31.38, 31.38, 31.38, 31.57, 31.57, 31.57, 31.57, 31.57, 31.63, 31.63, 31.63, 31.63, 31.63, 31.77, 31.77, 31.77, 31.77, 31.77, 31.86, 31.86, 31.86, 31.86, 31.86, 31.95, 31.95, 31.95, 31.95, 31.95, 31.85, 31.85, 31.85, 31.85, 31.85, 31.62, 31.62, 31.62, 31.62, 31.62, 31.26, 31.26, 31.26, 31.26, 31.26, 30.99, 30.99, 30.99, 30.99, 30.99, 31.06, 31.06, 31.06, 31.06, 31.06, 31.25, 31.25, 31.25, 31.25, 31.25, 31.34, 31.34, 31.34, 31.34, 31.34, 31.33, 31.33, 31.33, 31.33, 31.33, 31.08, 31.08, 31.08, 31.08, 31.08, 31.04, 31.04, 31.04, 31.04, 31.04, 29.5, 29.5, 29.5, 29.5, 29.5, 29.34, 29.34, 29.34, 29.34, 29.34, 29.31, 29.31, 29.31, 29.31, 29.31, 29.27, 29.27, 29.27, 29.27, 29.27, 29.26, 29.26, 29.26, 29.26, 29.26, 29.27, 29.27, 29.27, 29.27, 29.27, 29.27, 29.27, 29.27, 29.27, 29.27, 29.36, 29.36, 29.36, 29.36, 29.36, 29.37, 29.37, 29.37, 29.37, 29.37, 29.24, 29.24, 29.24, 29.24, 29.24, 29.3, 29.3, 29.3, 29.3, 29.3, 29.21, 29.21, 29.21, 29.21, 29.21, 29.26, 29.26, 29.26, 29.26, 29.26, 29.44, 29.44, 29.44, 29.44, 29.44, 29.6, 29.6, 29.6, 29.6, 29.6, 29.69, 29.69, 29.69, 29.69, 29.69, 29.74]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716538966 --> 1716539596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.44, 0.44, 0.44, 0.44, 0.44, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.36, 0.36, 0.36, 0.36, 0.36, 0.4, 0.4, 0.4, 0.4, 0.4, 0.46, 0.46, 0.46, 0.46, 0.46, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.25, 0.25, 0.25, 0.25, 0.25, 0.4, 0.4, 0.4, 0.4, 0.4, 0.35, 0.35, 0.35, 0.35, 0.35, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.4, 0.4, 0.4, 0.4, 0.4, 0.57, 0.57, 0.57, 0.57, 0.57, 0.51, 0.51, 0.51, 0.51, 0.51, 0.55, 0.55, 0.55, 0.55, 0.55, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.1, 0.1, 0.1, 0.1, 0.1, 0.26, 0.26, 0.26, 0.26, 0.26, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716538966 --> 1716539596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0]
                    

@mofosyne mofosyne added help wanted Extra attention is needed enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mofosyne mofosyne self-assigned this May 10, 2024
@mofosyne mofosyne removed their assignment May 10, 2024
@mofosyne mofosyne marked this pull request as ready for review May 15, 2024 03:12
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mnlife mnlife force-pushed the chatglm3 branch 2 times, most recently from 8aee20e to cb324f4 Compare May 15, 2024 05:28
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mnlife mnlife force-pushed the chatglm3 branch 2 times, most recently from ed1d3ff to 9226518 Compare May 23, 2024 08:14
@github-actions github-actions bot added testing Everything test related python python script changes labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed python python script changes review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants