Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: avoid breaking KV cache when prompt >= n_ctx #6958

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

prfd
Copy link

@prfd prfd commented Apr 28, 2024

This is the simplest solution to #6855, split truncate flag between shifted when the KV cache is shifted during inference and truncated when it happens before inference. Introduce n_truncate flag to shift both prompt and cached tokens when the prompt > n_ctx. Needs testing.

Copy link
Contributor

github-actions bot commented Apr 28, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8461.28ms p(95)=19971.9ms fails=, finish reason: stop=487 truncated=64
  • Prompt processing (pp): avg=100.81tk/s p(95)=485.98tk/s
  • Token generation (tg): avg=32.43tk/s p(95)=48.32tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=93af09a030fb0c5a61cbe7f975edc7cc379fe126

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 440.2, 440.2, 440.2, 440.2, 440.2, 516.91, 516.91, 516.91, 516.91, 516.91, 515.16, 515.16, 515.16, 515.16, 515.16, 569.93, 569.93, 569.93, 569.93, 569.93, 620.51, 620.51, 620.51, 620.51, 620.51, 643.28, 643.28, 643.28, 643.28, 643.28, 647.02, 647.02, 647.02, 647.02, 647.02, 676.57, 676.57, 676.57, 676.57, 676.57, 685.63, 685.63, 685.63, 685.63, 685.63, 706.36, 706.36, 706.36, 706.36, 706.36, 712.07, 712.07, 712.07, 712.07, 712.07, 729.25, 729.25, 729.25, 729.25, 729.25, 726.64, 726.64, 726.64, 726.64, 726.64, 757.17, 757.17, 757.17, 757.17, 757.17, 736.53, 736.53, 736.53, 736.53, 736.53, 741.72, 741.72, 741.72, 741.72, 741.72, 741.62, 741.62, 741.62, 741.62, 741.62, 738.45, 738.45, 738.45, 738.45, 738.45, 740.53, 740.53, 740.53, 740.53, 740.53, 741.55, 741.55, 741.55, 741.55, 741.55, 748.52, 748.52, 748.52, 748.52, 748.52, 752.26, 752.26, 752.26, 752.26, 752.26, 776.63, 776.63, 776.63, 776.63, 776.63, 775.62, 775.62, 775.62, 775.62, 775.62, 779.13, 779.13, 779.13, 779.13, 779.13, 781.26, 781.26, 781.26, 781.26, 781.26, 793.1, 793.1, 793.1, 793.1, 793.1, 789.39, 789.39, 789.39, 789.39, 789.39, 790.32, 790.32, 790.32, 790.32, 790.32, 796.53, 796.53, 796.53, 796.53, 796.53, 797.69, 797.69, 797.69, 797.69, 797.69, 796.34, 796.34, 796.34, 796.34, 796.34, 800.65, 800.65, 800.65, 800.65, 800.65, 806.13, 806.13, 806.13, 806.13, 806.13, 814.7, 814.7, 814.7, 814.7, 814.7, 825.07, 825.07, 825.07, 825.07, 825.07, 824.84, 824.84, 824.84, 824.84, 824.84, 823.33, 823.33, 823.33, 823.33, 823.33, 826.25, 826.25, 826.25, 826.25, 826.25, 829.09, 829.09, 829.09, 829.09, 829.09, 828.87, 828.87, 828.87, 828.87, 828.87, 831.25, 831.25, 831.25, 831.25, 831.25, 816.21, 816.21, 816.21, 816.21, 816.21, 810.34, 810.34, 810.34, 810.34, 810.34, 809.33, 809.33, 809.33, 809.33, 809.33, 805.12, 805.12, 805.12, 805.12, 805.12, 806.27, 806.27, 806.27, 806.27, 806.27, 806.4, 806.4, 806.4, 806.4, 806.4, 807.9, 807.9, 807.9, 807.9, 807.9, 809.9, 809.9, 809.9, 809.9, 809.9, 812.44, 812.44, 812.44, 812.44, 812.44, 817.6, 817.6, 817.6, 817.6, 817.6, 817.57, 817.57, 817.57, 817.57, 817.57, 817.61, 817.61, 817.61, 817.61, 817.61, 819.8, 819.8, 819.8, 819.8, 819.8, 820.23, 820.23, 820.23, 820.23, 820.23, 820.24, 820.24, 820.24, 820.24, 820.24, 821.36, 821.36, 821.36, 821.36, 821.36, 821.52, 821.52, 821.52, 821.52, 821.52, 824.72, 824.72, 824.72, 824.72, 824.72, 825.27, 825.27]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.66, 41.66, 41.66, 41.66, 41.66, 43.4, 43.4, 43.4, 43.4, 43.4, 35.45, 35.45, 35.45, 35.45, 35.45, 32.17, 32.17, 32.17, 32.17, 32.17, 32.45, 32.45, 32.45, 32.45, 32.45, 32.47, 32.47, 32.47, 32.47, 32.47, 32.98, 32.98, 32.98, 32.98, 32.98, 33.91, 33.91, 33.91, 33.91, 33.91, 33.99, 33.99, 33.99, 33.99, 33.99, 34.16, 34.16, 34.16, 34.16, 34.16, 34.1, 34.1, 34.1, 34.1, 34.1, 33.76, 33.76, 33.76, 33.76, 33.76, 33.03, 33.03, 33.03, 33.03, 33.03, 32.37, 32.37, 32.37, 32.37, 32.37, 31.86, 31.86, 31.86, 31.86, 31.86, 32.01, 32.01, 32.01, 32.01, 32.01, 32.42, 32.42, 32.42, 32.42, 32.42, 32.4, 32.4, 32.4, 32.4, 32.4, 32.04, 32.04, 32.04, 32.04, 32.04, 31.7, 31.7, 31.7, 31.7, 31.7, 31.52, 31.52, 31.52, 31.52, 31.52, 31.62, 31.62, 31.62, 31.62, 31.62, 31.8, 31.8, 31.8, 31.8, 31.8, 31.53, 31.53, 31.53, 31.53, 31.53, 31.58, 31.58, 31.58, 31.58, 31.58, 31.74, 31.74, 31.74, 31.74, 31.74, 31.6, 31.6, 31.6, 31.6, 31.6, 31.52, 31.52, 31.52, 31.52, 31.52, 31.53, 31.53, 31.53, 31.53, 31.53, 31.79, 31.79, 31.79, 31.79, 31.79, 31.88, 31.88, 31.88, 31.88, 31.88, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 31.92, 31.92, 31.92, 31.92, 31.92, 31.83, 31.83, 31.83, 31.83, 31.83, 31.73, 31.73, 31.73, 31.73, 31.73, 31.37, 31.37, 31.37, 31.37, 31.37, 31.4, 31.4, 31.4, 31.4, 31.4, 31.49, 31.49, 31.49, 31.49, 31.49, 31.65, 31.65, 31.65, 31.65, 31.65, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.59, 31.59, 31.59, 31.59, 31.59, 31.26, 31.26, 31.26, 31.26, 31.26, 30.95, 30.95, 30.95, 30.95, 30.95, 30.02, 30.02, 30.02, 30.02, 30.02, 30.0, 30.0, 30.0, 30.0, 30.0, 30.03, 30.03, 30.03, 30.03, 30.03, 30.15, 30.15, 30.15, 30.15, 30.15, 30.21, 30.21, 30.21, 30.21, 30.21, 30.39, 30.39, 30.39, 30.39, 30.39, 30.3, 30.3, 30.3, 30.3, 30.3, 30.27, 30.27, 30.27, 30.27, 30.27, 30.09, 30.09, 30.09, 30.09, 30.09, 30.21, 30.21, 30.21, 30.21, 30.21, 30.38, 30.38, 30.38, 30.38, 30.38, 30.43, 30.43, 30.43, 30.43, 30.43, 30.51, 30.51, 30.51, 30.51, 30.51, 30.53, 30.53, 30.53, 30.53, 30.53, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.36, 0.36, 0.36, 0.36, 0.36, 0.5, 0.5, 0.5, 0.5, 0.5, 0.51, 0.51, 0.51, 0.51, 0.51, 0.44, 0.44, 0.44, 0.44, 0.44, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.2, 0.2]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0]
                    

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on static code analysis I think the code works. I have comparatively little experience with the server so it would be useful if someone else were to also check. Also it would be good to have an automated test for context shifting but I don't know how we could write one.

Slight nitpick: I personally prefer explicit bool <-> int conversions.

examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
@prfd
Copy link
Author

prfd commented Apr 28, 2024

Based on static code analysis I think the code works. I have comparatively little experience with the server so it would be useful if someone else were to also check. Also it would be good to have an automated test for context shifting but I don't know how we could write one.

Slight nitpick: I personally prefer explicit bool <-> int conversions.

Yeah, writing a test for shifting itself is a bit tricky, but not for the truncation feature.
Basically, it should works like this:

  1. Send a completion request with cache_prompt: true and make sure no shifting/truncation happens.
  2. Send a completion request with prompt > ctx.
  3. Check how many tokens were evaluated, it should be the same amount added on step 2.

@prfd prfd marked this pull request as ready for review May 3, 2024 16:57
@mofosyne mofosyne added enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
@prfd prfd marked this pull request as draft May 10, 2024 14:48
@prfd prfd marked this pull request as ready for review May 12, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants