Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : disable pipeline parallelism with nkvo #7265

Merged
merged 1 commit into from
May 14, 2024
Merged

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented May 13, 2024

Pipeline parallelism does not work with no KV offload, but still increases memory usage significantly.

Fixes #7217

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 545 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8622.78ms p(95)=20906.05ms fails=, finish reason: stop=487 truncated=58
  • Prompt processing (pp): avg=102.77tk/s p(95)=468.37tk/s
  • Token generation (tg): avg=34.32tk/s p(95)=49.21tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/disable-pp-nkvo commit=94061d58e747f24e574d4de8cf8c0dcd1b89cc3e

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715636467 --> 1715637097
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 293.41, 293.41, 293.41, 293.41, 293.41, 831.02, 831.02, 831.02, 831.02, 831.02, 773.1, 773.1, 773.1, 773.1, 773.1, 780.54, 780.54, 780.54, 780.54, 780.54, 841.22, 841.22, 841.22, 841.22, 841.22, 849.26, 849.26, 849.26, 849.26, 849.26, 845.68, 845.68, 845.68, 845.68, 845.68, 868.68, 868.68, 868.68, 868.68, 868.68, 872.71, 872.71, 872.71, 872.71, 872.71, 884.05, 884.05, 884.05, 884.05, 884.05, 881.14, 881.14, 881.14, 881.14, 881.14, 884.3, 884.3, 884.3, 884.3, 884.3, 924.6, 924.6, 924.6, 924.6, 924.6, 942.58, 942.58, 942.58, 942.58, 942.58, 948.19, 948.19, 948.19, 948.19, 948.19, 943.3, 943.3, 943.3, 943.3, 943.3, 940.5, 940.5, 940.5, 940.5, 940.5, 942.17, 942.17, 942.17, 942.17, 942.17, 935.48, 935.48, 935.48, 935.48, 935.48, 950.7, 950.7, 950.7, 950.7, 950.7, 948.91, 948.91, 948.91, 948.91, 948.91, 946.47, 946.47, 946.47, 946.47, 946.47, 950.32, 950.32, 950.32, 950.32, 950.32, 949.7, 949.7, 949.7, 949.7, 949.7, 926.75, 926.75, 926.75, 926.75, 926.75, 924.41, 924.41, 924.41, 924.41, 924.41, 922.9, 922.9, 922.9, 922.9, 922.9, 928.93, 928.93, 928.93, 928.93, 928.93, 917.96, 917.96, 917.96, 917.96, 917.96, 915.17, 915.17, 915.17, 915.17, 915.17, 916.83, 916.83, 916.83, 916.83, 916.83, 919.58, 919.58, 919.58, 919.58, 919.58, 916.8, 916.8, 916.8, 916.8, 916.8, 916.22, 916.22, 916.22, 916.22, 916.22, 917.33, 917.33, 917.33, 917.33, 917.33, 901.09, 901.09, 901.09, 901.09, 901.09, 910.76, 910.76, 910.76, 910.76, 910.76, 912.71, 912.71, 912.71, 912.71, 912.71, 910.17, 910.17, 910.17, 910.17, 910.17, 907.41, 907.41, 907.41, 907.41, 907.41, 909.79, 909.79, 909.79, 909.79, 909.79, 911.8, 911.8, 911.8, 911.8, 911.8, 915.64, 915.64, 915.64, 915.64, 915.64, 920.82, 920.82, 920.82, 920.82, 920.82, 906.86, 906.86, 906.86, 906.86, 906.86, 904.97, 904.97, 904.97, 904.97, 904.97, 902.87, 902.87, 902.87, 902.87, 902.87, 902.42, 902.42, 902.42, 902.42, 902.42, 903.53, 903.53, 903.53, 903.53, 903.53, 902.19, 902.19, 902.19, 902.19, 902.19, 899.47, 899.47, 899.47, 899.47, 899.47, 898.43, 898.43, 898.43, 898.43, 898.43, 900.35, 900.35, 900.35, 900.35, 900.35, 897.97, 897.97, 897.97, 897.97, 897.97, 897.5, 897.5, 897.5, 897.5, 897.5, 900.8, 900.8, 900.8, 900.8, 900.8, 901.69, 901.69, 901.69, 901.69, 901.69, 900.16, 900.16, 900.16, 900.16, 900.16, 900.6, 900.6, 900.6, 900.6, 900.6, 901.9, 901.9, 901.9, 901.9, 901.9, 904.92, 904.92, 904.92, 904.92, 904.92, 904.81]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715636467 --> 1715637097
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 49.59, 49.59, 49.59, 49.59, 49.59, 43.31, 43.31, 43.31, 43.31, 43.31, 29.48, 29.48, 29.48, 29.48, 29.48, 29.89, 29.89, 29.89, 29.89, 29.89, 31.82, 31.82, 31.82, 31.82, 31.82, 32.15, 32.15, 32.15, 32.15, 32.15, 34.16, 34.16, 34.16, 34.16, 34.16, 34.45, 34.45, 34.45, 34.45, 34.45, 34.24, 34.24, 34.24, 34.24, 34.24, 34.02, 34.02, 34.02, 34.02, 34.02, 33.63, 33.63, 33.63, 33.63, 33.63, 33.69, 33.69, 33.69, 33.69, 33.69, 33.19, 33.19, 33.19, 33.19, 33.19, 32.75, 32.75, 32.75, 32.75, 32.75, 32.43, 32.43, 32.43, 32.43, 32.43, 31.68, 31.68, 31.68, 31.68, 31.68, 29.23, 29.23, 29.23, 29.23, 29.23, 28.83, 28.83, 28.83, 28.83, 28.83, 29.02, 29.02, 29.02, 29.02, 29.02, 29.17, 29.17, 29.17, 29.17, 29.17, 29.16, 29.16, 29.16, 29.16, 29.16, 29.27, 29.27, 29.27, 29.27, 29.27, 29.43, 29.43, 29.43, 29.43, 29.43, 29.59, 29.59, 29.59, 29.59, 29.59, 29.79, 29.79, 29.79, 29.79, 29.79, 29.71, 29.71, 29.71, 29.71, 29.71, 29.76, 29.76, 29.76, 29.76, 29.76, 29.91, 29.91, 29.91, 29.91, 29.91, 29.76, 29.76, 29.76, 29.76, 29.76, 30.14, 30.14, 30.14, 30.14, 30.14, 30.34, 30.34, 30.34, 30.34, 30.34, 30.46, 30.46, 30.46, 30.46, 30.46, 30.55, 30.55, 30.55, 30.55, 30.55, 30.6, 30.6, 30.6, 30.6, 30.6, 30.76, 30.76, 30.76, 30.76, 30.76, 30.72, 30.72, 30.72, 30.72, 30.72, 30.61, 30.61, 30.61, 30.61, 30.61, 30.43, 30.43, 30.43, 30.43, 30.43, 30.16, 30.16, 30.16, 30.16, 30.16, 30.12, 30.12, 30.12, 30.12, 30.12, 30.25, 30.25, 30.25, 30.25, 30.25, 30.36, 30.36, 30.36, 30.36, 30.36, 30.44, 30.44, 30.44, 30.44, 30.44, 30.4, 30.4, 30.4, 30.4, 30.4, 30.13, 30.13, 30.13, 30.13, 30.13, 29.58, 29.58, 29.58, 29.58, 29.58, 29.34, 29.34, 29.34, 29.34, 29.34, 28.72, 28.72, 28.72, 28.72, 28.72, 28.63, 28.63, 28.63, 28.63, 28.63, 28.61, 28.61, 28.61, 28.61, 28.61, 28.66, 28.66, 28.66, 28.66, 28.66, 28.68, 28.68, 28.68, 28.68, 28.68, 28.8, 28.8, 28.8, 28.8, 28.8, 28.85, 28.85, 28.85, 28.85, 28.85, 28.87, 28.87, 28.87, 28.87, 28.87, 28.84, 28.84, 28.84, 28.84, 28.84, 28.93, 28.93, 28.93, 28.93, 28.93, 29.12, 29.12, 29.12, 29.12, 29.12, 29.28, 29.28, 29.28, 29.28, 29.28, 29.28, 29.28, 29.28, 29.28, 29.28, 29.43, 29.43, 29.43, 29.43, 29.43, 29.48]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715636467 --> 1715637097
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.41, 0.41, 0.41, 0.41, 0.41, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.32, 0.32, 0.32, 0.32, 0.32, 0.44, 0.44, 0.44, 0.44, 0.44, 0.5, 0.5, 0.5, 0.5, 0.5, 0.43, 0.43, 0.43, 0.43, 0.43, 0.24, 0.24, 0.24, 0.24, 0.24, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.36, 0.36, 0.36, 0.36, 0.36, 0.29, 0.29, 0.29, 0.29, 0.29, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.38, 0.38, 0.38, 0.38, 0.38, 0.51, 0.51, 0.51, 0.51, 0.51, 0.43, 0.43, 0.43, 0.43, 0.43, 0.52, 0.52, 0.52, 0.52, 0.52, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 545 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715636467 --> 1715637097
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0]
                    

@mofosyne mofosyne added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 14, 2024
@mofosyne mofosyne self-requested a review May 14, 2024 03:19
Copy link
Collaborator

@mofosyne mofosyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: This PR adds a single extra check for params.offload_kqv to pipeline_parallel switch in this commit.

@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 14, 2024
@mofosyne mofosyne merged commit 5416002 into master May 14, 2024
66 checks passed
@mofosyne mofosyne removed the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 14, 2024
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NKVO argument leads to huge compute buffers in full Cublas offload on a heterogeneous dual GPU config.
3 participants