server: avoid breaking KV cache when prompt >= n_ctx #6958

prfd · 2024-04-28T03:46:16Z

This is the simplest solution to #6855, split truncate flag between shifted when the KV cache is shifted during inference and truncated when it happens before inference. Introduce n_truncate flag to shift both prompt and cached tokens when the prompt > n_ctx. Needs testing.

github-actions · 2024-04-28T04:00:35Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8461.28ms p(95)=19971.9ms fails=, finish reason: stop=487 truncated=64
Prompt processing (pp): avg=100.81tk/s p(95)=485.98tk/s
Token generation (tg): avg=32.43tk/s p(95)=48.32tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=93af09a030fb0c5a61cbe7f975edc7cc379fe126

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 440.2, 440.2, 440.2, 440.2, 440.2, 516.91, 516.91, 516.91, 516.91, 516.91, 515.16, 515.16, 515.16, 515.16, 515.16, 569.93, 569.93, 569.93, 569.93, 569.93, 620.51, 620.51, 620.51, 620.51, 620.51, 643.28, 643.28, 643.28, 643.28, 643.28, 647.02, 647.02, 647.02, 647.02, 647.02, 676.57, 676.57, 676.57, 676.57, 676.57, 685.63, 685.63, 685.63, 685.63, 685.63, 706.36, 706.36, 706.36, 706.36, 706.36, 712.07, 712.07, 712.07, 712.07, 712.07, 729.25, 729.25, 729.25, 729.25, 729.25, 726.64, 726.64, 726.64, 726.64, 726.64, 757.17, 757.17, 757.17, 757.17, 757.17, 736.53, 736.53, 736.53, 736.53, 736.53, 741.72, 741.72, 741.72, 741.72, 741.72, 741.62, 741.62, 741.62, 741.62, 741.62, 738.45, 738.45, 738.45, 738.45, 738.45, 740.53, 740.53, 740.53, 740.53, 740.53, 741.55, 741.55, 741.55, 741.55, 741.55, 748.52, 748.52, 748.52, 748.52, 748.52, 752.26, 752.26, 752.26, 752.26, 752.26, 776.63, 776.63, 776.63, 776.63, 776.63, 775.62, 775.62, 775.62, 775.62, 775.62, 779.13, 779.13, 779.13, 779.13, 779.13, 781.26, 781.26, 781.26, 781.26, 781.26, 793.1, 793.1, 793.1, 793.1, 793.1, 789.39, 789.39, 789.39, 789.39, 789.39, 790.32, 790.32, 790.32, 790.32, 790.32, 796.53, 796.53, 796.53, 796.53, 796.53, 797.69, 797.69, 797.69, 797.69, 797.69, 796.34, 796.34, 796.34, 796.34, 796.34, 800.65, 800.65, 800.65, 800.65, 800.65, 806.13, 806.13, 806.13, 806.13, 806.13, 814.7, 814.7, 814.7, 814.7, 814.7, 825.07, 825.07, 825.07, 825.07, 825.07, 824.84, 824.84, 824.84, 824.84, 824.84, 823.33, 823.33, 823.33, 823.33, 823.33, 826.25, 826.25, 826.25, 826.25, 826.25, 829.09, 829.09, 829.09, 829.09, 829.09, 828.87, 828.87, 828.87, 828.87, 828.87, 831.25, 831.25, 831.25, 831.25, 831.25, 816.21, 816.21, 816.21, 816.21, 816.21, 810.34, 810.34, 810.34, 810.34, 810.34, 809.33, 809.33, 809.33, 809.33, 809.33, 805.12, 805.12, 805.12, 805.12, 805.12, 806.27, 806.27, 806.27, 806.27, 806.27, 806.4, 806.4, 806.4, 806.4, 806.4, 807.9, 807.9, 807.9, 807.9, 807.9, 809.9, 809.9, 809.9, 809.9, 809.9, 812.44, 812.44, 812.44, 812.44, 812.44, 817.6, 817.6, 817.6, 817.6, 817.6, 817.57, 817.57, 817.57, 817.57, 817.57, 817.61, 817.61, 817.61, 817.61, 817.61, 819.8, 819.8, 819.8, 819.8, 819.8, 820.23, 820.23, 820.23, 820.23, 820.23, 820.24, 820.24, 820.24, 820.24, 820.24, 821.36, 821.36, 821.36, 821.36, 821.36, 821.52, 821.52, 821.52, 821.52, 821.52, 824.72, 824.72, 824.72, 824.72, 824.72, 825.27, 825.27]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.66, 41.66, 41.66, 41.66, 41.66, 43.4, 43.4, 43.4, 43.4, 43.4, 35.45, 35.45, 35.45, 35.45, 35.45, 32.17, 32.17, 32.17, 32.17, 32.17, 32.45, 32.45, 32.45, 32.45, 32.45, 32.47, 32.47, 32.47, 32.47, 32.47, 32.98, 32.98, 32.98, 32.98, 32.98, 33.91, 33.91, 33.91, 33.91, 33.91, 33.99, 33.99, 33.99, 33.99, 33.99, 34.16, 34.16, 34.16, 34.16, 34.16, 34.1, 34.1, 34.1, 34.1, 34.1, 33.76, 33.76, 33.76, 33.76, 33.76, 33.03, 33.03, 33.03, 33.03, 33.03, 32.37, 32.37, 32.37, 32.37, 32.37, 31.86, 31.86, 31.86, 31.86, 31.86, 32.01, 32.01, 32.01, 32.01, 32.01, 32.42, 32.42, 32.42, 32.42, 32.42, 32.4, 32.4, 32.4, 32.4, 32.4, 32.04, 32.04, 32.04, 32.04, 32.04, 31.7, 31.7, 31.7, 31.7, 31.7, 31.52, 31.52, 31.52, 31.52, 31.52, 31.62, 31.62, 31.62, 31.62, 31.62, 31.8, 31.8, 31.8, 31.8, 31.8, 31.53, 31.53, 31.53, 31.53, 31.53, 31.58, 31.58, 31.58, 31.58, 31.58, 31.74, 31.74, 31.74, 31.74, 31.74, 31.6, 31.6, 31.6, 31.6, 31.6, 31.52, 31.52, 31.52, 31.52, 31.52, 31.53, 31.53, 31.53, 31.53, 31.53, 31.79, 31.79, 31.79, 31.79, 31.79, 31.88, 31.88, 31.88, 31.88, 31.88, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 32.07, 31.92, 31.92, 31.92, 31.92, 31.92, 31.83, 31.83, 31.83, 31.83, 31.83, 31.73, 31.73, 31.73, 31.73, 31.73, 31.37, 31.37, 31.37, 31.37, 31.37, 31.4, 31.4, 31.4, 31.4, 31.4, 31.49, 31.49, 31.49, 31.49, 31.49, 31.65, 31.65, 31.65, 31.65, 31.65, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.8, 31.59, 31.59, 31.59, 31.59, 31.59, 31.26, 31.26, 31.26, 31.26, 31.26, 30.95, 30.95, 30.95, 30.95, 30.95, 30.02, 30.02, 30.02, 30.02, 30.02, 30.0, 30.0, 30.0, 30.0, 30.0, 30.03, 30.03, 30.03, 30.03, 30.03, 30.15, 30.15, 30.15, 30.15, 30.15, 30.21, 30.21, 30.21, 30.21, 30.21, 30.39, 30.39, 30.39, 30.39, 30.39, 30.3, 30.3, 30.3, 30.3, 30.3, 30.27, 30.27, 30.27, 30.27, 30.27, 30.09, 30.09, 30.09, 30.09, 30.09, 30.21, 30.21, 30.21, 30.21, 30.21, 30.38, 30.38, 30.38, 30.38, 30.38, 30.43, 30.43, 30.43, 30.43, 30.43, 30.51, 30.51, 30.51, 30.51, 30.51, 30.53, 30.53, 30.53, 30.53, 30.53, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.36, 0.36, 0.36, 0.36, 0.36, 0.5, 0.5, 0.5, 0.5, 0.5, 0.51, 0.51, 0.51, 0.51, 0.51, 0.44, 0.44, 0.44, 0.44, 0.44, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.2, 0.2]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714702187 --> 1714702809
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0]

JohannesGaessler

Based on static code analysis I think the code works. I have comparatively little experience with the server so it would be useful if someone else were to also check. Also it would be good to have an automated test for context shifting but I don't know how we could write one.

Slight nitpick: I personally prefer explicit bool <-> int conversions.

examples/server/server.cpp

prfd · 2024-04-28T22:44:57Z

Based on static code analysis I think the code works. I have comparatively little experience with the server so it would be useful if someone else were to also check. Also it would be good to have an automated test for context shifting but I don't know how we could write one.

Slight nitpick: I personally prefer explicit bool <-> int conversions.

Yeah, writing a test for shifting itself is a bit tricky, but not for the truncation feature.
Basically, it should works like this:

Send a completion request with cache_prompt: true and make sure no shifting/truncation happens.
Send a completion request with prompt > ctx.
Check how many tokens were evaluated, it should be the same amount added on step 2.

prfd added 2 commits April 28, 2024 00:34

server: avoid breaking KV cache when prompt >= n_ctx

91d94ee

fix typo

0c115da

prfd mentioned this pull request Apr 28, 2024

server: avoid full prompt eval when 'prompt >= ctx' #6855

Open

JohannesGaessler reviewed Apr 28, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

prfd added 3 commits May 2, 2024 21:28

don't shift if there's no truncation

4a471b1

add test

a772cde

Merge branch 'master' into master

93af09a

prfd marked this pull request as ready for review May 3, 2024 16:57

mofosyne added enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024

prfd marked this pull request as draft May 10, 2024 14:48

CRLF -> LF

aa6f4c2

prfd force-pushed the master branch from 339b2a5 to aa6f4c2 Compare May 10, 2024 15:06

prfd marked this pull request as ready for review May 12, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: avoid breaking KV cache when prompt >= n_ctx #6958

server: avoid breaking KV cache when prompt >= n_ctx #6958

prfd commented Apr 28, 2024

github-actions bot commented Apr 28, 2024 •

edited

JohannesGaessler left a comment

prfd commented Apr 28, 2024

server: avoid breaking KV cache when prompt >= n_ctx #6958

Are you sure you want to change the base?

server: avoid breaking KV cache when prompt >= n_ctx #6958

Conversation

prfd commented Apr 28, 2024

github-actions bot commented Apr 28, 2024 • edited

JohannesGaessler left a comment

Choose a reason for hiding this comment

prfd commented Apr 28, 2024

github-actions bot commented Apr 28, 2024 •

edited