ggml : add RPC backend #6829

rgerganov · 2024-04-22T14:51:11Z

This PR transfers the work started in ggml PR 761 here. It adds an RPC backend which proxies all backend operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). The general idea is to allow distributed LLM inference using multiple hosts running on different kinds of hardware.

This is a sample run which splits the layers of 7B F16 model on two servers, allocating 7G on the first and 6.5G on the second:

...
llm_load_tensors: ggml ctx size =    0,44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   296,88 MiB
llm_load_tensors:        RPC buffer size =  7072,53 MiB
llm_load_tensors:        RPC buffer size =  6537,36 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        RPC KV buffer size =    34,00 MiB
llama_kv_cache_init:        RPC KV buffer size =    30,00 MiB
llama_new_context_with_model: KV self size  =   64,00 MiB, K (f16):   32,00 MiB, V (f16):   32,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,14 MiB
llama_new_context_with_model:       RPC0 compute buffer size =    73,00 MiB
llama_new_context_with_model:       RPC1 compute buffer size =    82,22 MiB
llama_new_context_with_model:        CPU compute buffer size =     9,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3
...

Current limitations:

~~Quantum models are not supported~~
Pipeline parallelism is not currently supported
~~Server endpoints are hardcoded in ggml-rpc.cpp~~

Building:

Build the main example with cmake -DLLAMA_RPC=ON ..
Build rpc-server in a separate dir, adding the flag for the corresponding backend, e.g. cmake -DLLAMA_RPC=ON -DLLAMA_CUDA=ON ..

phymbert · 2024-04-22T14:55:51Z

@rgerganov Nice to meet you :D

sorasoras · 2024-04-22T16:18:01Z

in theory, Could this PR allow GPU inference across different API? I have P40 and 7900XTX. Could they work together with their own API?

rgerganov · 2024-04-23T08:32:49Z

in theory, Could this PR allow GPU inference across different API?

Yes, you can use different backend implementations, running on different machines. Build an rpc-server for each configuration and run them in the same local network. The main example should be configured with the IP:port of each rpc-server and it should be able to offload model layers to them.

ggerganov · 2024-04-25T10:22:20Z

Would be useful to add a CI workflow that builds the RPC backend. No need to run tests for now - just make sure the build succeeds

rgerganov · 2024-04-29T07:25:07Z

I tried to implement this without gRPC, using only socket API: https://github.com/rgerganov/llama.cpp/tree/socket-rpc
Unfortunately, this implementation performs much worse compared to the gRPC one. When I am running rpc-server on localhost, I get 25t/s with gRPC and 15t/s with my custom socket RPC, using the same model. I don't think my serialization is much worse compared to protobuf, so I guess I am doing the networking part wrong.

I don't like adding gRPC as build time dependency but it looks like it is not trivial to implement this from scratch even for simple synchronous APIs ...

ggerganov · 2024-04-29T09:49:37Z

Unfortunately, this implementation performs much worse compared to the gRPC one.

Long shot, but does it help if you disable Nagle's algorithm for the socket: https://stackoverflow.com/a/17843292/4039976

rgerganov · 2024-04-29T11:30:19Z

Long shot, but does it help if you disable Nagle's algorithm for the socket

Spot on! Setting TCP_NODELAY is game changer:

CUDA backend: 48 t/s
RPC backend with gRPC: 25 t/s
RPC backend with socket-rpc: 15 t/s
RPC backend with socket-rpc and setting TCP_NODELAY: 43 t/s

gRPC is also setting this by default

rgerganov · 2024-04-29T11:42:24Z

I will continue working with my custom socket RPC in this PR. The previous gRPC implementation is still available at https://github.com/rgerganov/llama.cpp/tree/grpc

github-actions · 2024-04-29T12:16:45Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8690.33ms p(95)=20965.24ms fails=, finish reason: stop=485 truncated=54
Prompt processing (pp): avg=98.38tk/s p(95)=362.75tk/s
Token generation (tg): avg=45.6tk/s p(95)=45.72tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=rpc commit=1519cb4582db5966656b889dda419baead501c31

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 949.5, 949.5, 949.5, 949.5, 949.5, 869.3, 869.3, 869.3, 869.3, 869.3, 885.45, 885.45, 885.45, 885.45, 885.45, 902.39, 902.39, 902.39, 902.39, 902.39, 868.57, 868.57, 868.57, 868.57, 868.57, 858.49, 858.49, 858.49, 858.49, 858.49, 873.21, 873.21, 873.21, 873.21, 873.21, 884.89, 884.89, 884.89, 884.89, 884.89, 875.48, 875.48, 875.48, 875.48, 875.48, 888.78, 888.78, 888.78, 888.78, 888.78, 860.97, 860.97, 860.97, 860.97, 860.97, 902.28, 902.28, 902.28, 902.28, 902.28, 898.07, 898.07, 898.07, 898.07, 898.07, 899.85, 899.85, 899.85, 899.85, 899.85, 843.75, 843.75, 843.75, 843.75, 843.75, 842.54, 842.54, 842.54, 842.54, 842.54, 846.42, 846.42, 846.42, 846.42, 846.42, 848.24, 848.24, 848.24, 848.24, 848.24, 845.45, 845.45, 845.45, 845.45, 845.45, 808.15, 808.15, 808.15, 808.15, 808.15, 810.6, 810.6, 810.6, 810.6, 810.6, 818.09, 818.09, 818.09, 818.09, 818.09, 821.47, 821.47, 821.47, 821.47, 821.47, 822.74, 822.74, 822.74, 822.74, 822.74, 785.98, 785.98, 785.98, 785.98, 785.98, 787.27, 787.27, 787.27, 787.27, 787.27, 788.67, 788.67, 788.67, 788.67, 788.67, 798.45, 798.45, 798.45, 798.45, 798.45, 803.55, 803.55, 803.55, 803.55, 803.55, 803.43, 803.43, 803.43, 803.43, 803.43, 804.15, 804.15, 804.15, 804.15, 804.15, 806.03, 806.03, 806.03, 806.03, 806.03, 805.08, 805.08, 805.08, 805.08, 805.08, 802.5, 802.5, 802.5, 802.5, 802.5, 807.6, 807.6, 807.6, 807.6, 807.6, 813.53, 813.53, 813.53, 813.53, 813.53, 820.61, 820.61, 820.61, 820.61, 820.61, 832.12, 832.12, 832.12, 832.12, 832.12, 835.13, 835.13, 835.13, 835.13, 835.13, 833.35, 833.35, 833.35, 833.35, 833.35, 836.45, 836.45, 836.45, 836.45, 836.45, 839.8, 839.8, 839.8, 839.8, 839.8, 839.53, 839.53, 839.53, 839.53, 839.53, 790.68, 790.68, 790.68, 790.68, 790.68, 781.86, 781.86, 781.86, 781.86, 781.86, 782.28, 782.28, 782.28, 782.28, 782.28, 781.01, 781.01, 781.01, 781.01, 781.01, 779.66, 779.66, 779.66, 779.66, 779.66, 786.94, 786.94, 786.94, 786.94, 786.94, 786.86, 786.86, 786.86, 786.86, 786.86, 788.48, 788.48, 788.48, 788.48, 788.48, 787.41, 787.41, 787.41, 787.41, 787.41, 794.06, 794.06, 794.06, 794.06, 794.06, 795.63, 795.63, 795.63, 795.63, 795.63, 802.4, 802.4, 802.4, 802.4, 802.4, 802.07, 802.07, 802.07, 802.07, 802.07, 803.34, 803.34, 803.34, 803.34, 803.34, 803.43, 803.43, 803.43, 803.43, 803.43, 805.02, 805.02, 805.02, 805.02, 805.02, 806.4, 806.4, 806.4, 806.4, 806.4, 806.6, 806.6, 806.6, 806.6]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.35, 43.35, 43.35, 43.35, 43.35, 25.97, 25.97, 25.97, 25.97, 25.97, 26.99, 26.99, 26.99, 26.99, 26.99, 31.33, 31.33, 31.33, 31.33, 31.33, 31.91, 31.91, 31.91, 31.91, 31.91, 32.26, 32.26, 32.26, 32.26, 32.26, 32.57, 32.57, 32.57, 32.57, 32.57, 33.37, 33.37, 33.37, 33.37, 33.37, 33.55, 33.55, 33.55, 33.55, 33.55, 33.58, 33.58, 33.58, 33.58, 33.58, 33.78, 33.78, 33.78, 33.78, 33.78, 33.66, 33.66, 33.66, 33.66, 33.66, 32.71, 32.71, 32.71, 32.71, 32.71, 32.42, 32.42, 32.42, 32.42, 32.42, 30.78, 30.78, 30.78, 30.78, 30.78, 30.54, 30.54, 30.54, 30.54, 30.54, 28.66, 28.66, 28.66, 28.66, 28.66, 28.32, 28.32, 28.32, 28.32, 28.32, 28.95, 28.95, 28.95, 28.95, 28.95, 28.87, 28.87, 28.87, 28.87, 28.87, 28.92, 28.92, 28.92, 28.92, 28.92, 29.09, 29.09, 29.09, 29.09, 29.09, 29.23, 29.23, 29.23, 29.23, 29.23, 29.64, 29.64, 29.64, 29.64, 29.64, 29.65, 29.65, 29.65, 29.65, 29.65, 29.68, 29.68, 29.68, 29.68, 29.68, 29.92, 29.92, 29.92, 29.92, 29.92, 30.05, 30.05, 30.05, 30.05, 30.05, 29.76, 29.76, 29.76, 29.76, 29.76, 29.64, 29.64, 29.64, 29.64, 29.64, 29.74, 29.74, 29.74, 29.74, 29.74, 29.95, 29.95, 29.95, 29.95, 29.95, 30.09, 30.09, 30.09, 30.09, 30.09, 30.25, 30.25, 30.25, 30.25, 30.25, 30.3, 30.3, 30.3, 30.3, 30.3, 30.35, 30.35, 30.35, 30.35, 30.35, 30.25, 30.25, 30.25, 30.25, 30.25, 30.0, 30.0, 30.0, 30.0, 30.0, 29.84, 29.84, 29.84, 29.84, 29.84, 29.8, 29.8, 29.8, 29.8, 29.8, 30.01, 30.01, 30.01, 30.01, 30.01, 30.14, 30.14, 30.14, 30.14, 30.14, 30.24, 30.24, 30.24, 30.24, 30.24, 30.38, 30.38, 30.38, 30.38, 30.38, 30.16, 30.16, 30.16, 30.16, 30.16, 30.05, 30.05, 30.05, 30.05, 30.05, 29.64, 29.64, 29.64, 29.64, 29.64, 28.75, 28.75, 28.75, 28.75, 28.75, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87, 28.87, 28.98, 28.98, 28.98, 28.98, 28.98, 29.04, 29.04, 29.04, 29.04, 29.04, 29.11, 29.11, 29.11, 29.11, 29.11, 29.08, 29.08, 29.08, 29.08, 29.08, 29.12, 29.12, 29.12, 29.12, 29.12, 29.08, 29.08, 29.08, 29.08, 29.08, 29.16, 29.16, 29.16, 29.16, 29.16, 29.3, 29.3, 29.3, 29.3, 29.3, 29.43, 29.43, 29.43, 29.43, 29.43, 29.52, 29.52, 29.52, 29.52, 29.52, 29.61, 29.61, 29.61, 29.61]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.48, 0.48, 0.48, 0.48, 0.48, 0.43, 0.43, 0.43, 0.43, 0.43, 0.39, 0.39, 0.39, 0.39, 0.39, 0.32, 0.32, 0.32, 0.32, 0.32, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.4, 0.4, 0.4, 0.4, 0.4, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.45, 0.45, 0.45, 0.45, 0.45, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0]

llama.h

slaren · 2024-04-30T11:26:22Z

llama_max_devices should be updated to return some value higher than 1 when building with RPC. We should probably remove this function, or make it always return the same value, but for now for consistency it needs to return the maximum number of devices, since llama_model_params::tensor_split is documented to have size llama_max_devices.

rgerganov · 2024-04-30T11:35:23Z

It returns GGML_RPC_MAX_SERVERS now which is set to 16

ggml-rpc.cpp

rgerganov · 2024-04-30T12:36:58Z

Thanks for the reviews. I will continue working on this next week. I need to address couple of TODOs, add Windows support, fix some resource leaks and add a README.

* ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos

chraac · 2024-05-18T02:54:38Z

ggml-rpc.cpp

+ }
+ printf("Accepted client connection, free_mem=%zu, total_mem=%zu\n", free_mem, total_mem);
+ rpc_serve_client(backend, client_socket->fd, free_mem, total_mem);
+ printf("Client connection closed\n");


Hi @rgerganov, thank you for your valuable contribution to enhance the distribution capability in llama.cpp. have been using your implementation for several days and have noticed an issue:
When the client closes the connection, the server does not free the memory it has allocated.

Upon investigating the source code, I discovered that instead of releasing the memory, we simply exit the inner loop and immediately wait for a new connection.

Wondering if we should monitor the ALLOC_BUFFER and FREE_BUFFER, specifically by maintaining a list of allocated buffers. This would allow us to free any remaining buffers once the client closes the connection.

Update: create a patch for this on may fork, maybe you can have a look and give some advice: https://github.com/chraac/llama.cpp/tree/dev-fix-rpc-mem-alloc

@chraac just a suggestion - I think for maintainers would be easier if you can open up a PR directly with the changes, so discussions goes directly there.

@chraac just a suggestion - I think for maintainers would be easier if you can open up a PR directly with the changes, so discussions goes directly there.

Yeah, thanks, already have a PR and have some discussions for it.

@rgerganov

As #2324 introduced distributed inferencing thanks to @rgerganov implementation in ggerganov/llama.cpp#6829 in upstream llama.cpp, now it is possible to distribute the workload to remote llama.cpp gRPC server. This changeset now uses mudler/edgevpn to establish a secure, distributed network between the nodes using a shared token. The token is generated automatically when starting the server with the `--p2p` flag, and can be used by starting the workers with `local-ai worker p2p-llama-cpp-rpc` by passing the token via environment variable (TOKEN) or with args (--token). As per how mudler/edgevpn works, a network is established between the server and the workers with dht and mdns discovery protocols, the llama.cpp rpc server is automatically started and exposed to the underlying p2p network so the API server can connect on. When the HTTP server is started, it will discover the workers in the network and automatically create the port-forwards to the service locally. Then llama.cpp is configured to use the services. This feature is behind the "p2p" GO_FLAGS Signed-off-by: Ettore Di Giacinto <[email protected]>

elix1er · 2024-05-20T11:06:33Z

dont stop this candy

@rgerganov

…erence (#2343) * feat(llama.cpp): Enable decentralized, distributed inference As #2324 introduced distributed inferencing thanks to @rgerganov implementation in ggerganov/llama.cpp#6829 in upstream llama.cpp, now it is possible to distribute the workload to remote llama.cpp gRPC server. This changeset now uses mudler/edgevpn to establish a secure, distributed network between the nodes using a shared token. The token is generated automatically when starting the server with the `--p2p` flag, and can be used by starting the workers with `local-ai worker p2p-llama-cpp-rpc` by passing the token via environment variable (TOKEN) or with args (--token). As per how mudler/edgevpn works, a network is established between the server and the workers with dht and mdns discovery protocols, the llama.cpp rpc server is automatically started and exposed to the underlying p2p network so the API server can connect on. When the HTTP server is started, it will discover the workers in the network and automatically create the port-forwards to the service locally. Then llama.cpp is configured to use the services. This feature is behind the "p2p" GO_FLAGS Signed-off-by: Ettore Di Giacinto <[email protected]> * go mod tidy Signed-off-by: Ettore Di Giacinto <[email protected]> * ci: add p2p tag Signed-off-by: Ettore Di Giacinto <[email protected]> * better message Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]>

…6.0 by renovate (#22420) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [docker.io/localai/localai](https://togithub.com/mudler/LocalAI) | minor | `v2.15.0-cublas-cuda11-ffmpeg-core` -> `v2.16.0-cublas-cuda11-ffmpeg-core` | | [docker.io/localai/localai](https://togithub.com/mudler/LocalAI) | minor | `v2.15.0-cublas-cuda11-core` -> `v2.16.0-cublas-cuda11-core` | | [docker.io/localai/localai](https://togithub.com/mudler/LocalAI) | minor | `v2.15.0-cublas-cuda12-ffmpeg-core` -> `v2.16.0-cublas-cuda12-ffmpeg-core` | | [docker.io/localai/localai](https://togithub.com/mudler/LocalAI) | minor | `v2.15.0-cublas-cuda12-core` -> `v2.16.0-cublas-cuda12-core` | | [docker.io/localai/localai](https://togithub.com/mudler/LocalAI) | minor | `v2.15.0-ffmpeg-core` -> `v2.16.0-ffmpeg-core` | | [docker.io/localai/localai](https://togithub.com/mudler/LocalAI) | minor | `v2.15.0` -> `v2.16.0` | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>mudler/LocalAI (docker.io/localai/localai)</summary> ### [`v2.16.0`](https://togithub.com/mudler/LocalAI/releases/tag/v2.16.0) [Compare Source](https://togithub.com/mudler/LocalAI/compare/v2.15.0...v2.16.0) ![local-ai-release-2 16](https://togithub.com/mudler/LocalAI/assets/2420543/bd3a6ace-8aec-4ac7-b457-b3e8cb5bb29e) ##### Welcome to LocalAI's latest update! ##### 🎉🎉🎉 woot woot! So excited to share this release, a lot of new features are landing in LocalAI!!!!! 🎉🎉🎉 ![](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ2cycjRqbXFld2toenpqcjcyN3E3eWw1NHI5cm12Njc3Y2lzZWtyZyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/AR92HqL0HcenC/giphy.gif) ##### 🌟 Introducing Distributed Llama.cpp Inferencing Now it is possible to distribute the inferencing workload across different workers with llama.cpp models ! This feature has landed with [mudler/LocalAI#2324 and is based on the upstream work of [@rgerganov](https://togithub.com/rgerganov) in [ggerganov/llama.cpp#6829. **How it works:** a front-end server manages the requests compatible with the OpenAI API (LocalAI) and workers (llama.cpp) are used to distribute the workload. This makes possible to run larger models split across different nodes! ##### How to use it To start workers to offload the computation you can run: local-ai llamacpp-worker <listening_address> <listening_port> However, you can also follow the llama.cpp README and building the rpc-server (https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md), which is still compatible with LocalAI. When starting the LocalAI server, which is going to accept the API requests, you can set a list of workers IP/address by specifying the addresses with `LLAMACPP_GRPC_SERVERS`: ```bash LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run ``` At this point the workload hitting in the LocalAI server should be distributed across the nodes! ##### 🤖 Peer2Peer llama.cpp LocalAI is the **first** AI Free, Open source project offering complete, decentralized, peer2peer while private, LLM inferencing on top of the libp2p protocol. There is no "public swarm" to offload the computation, but rather empowers you to build your own cluster of local and remote machines to distribute LLM computation. ![](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExZTdrZW9rc3hrMWxoZTV1OGo0ajF3d2MwMHFmeXVoMThqOGg1eHR4ZCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/q0KrtRcr10Bhu/giphy.gif) This feature leverages the ability of llama.cpp to distribute the workload explained just above and features from one of my other projects, https://github.com/mudler/edgevpn. LocalAI builds on top of the twos, and allows to create a private peer2peer network between nodes, without the need of centralizing connections or manually configuring IP addresses: it unlocks totally decentralized, private, peer-to-peer inferencing capabilities. Works also behind different NAT-ted networks (uses DHT and mDNS as discovery mechanism). **How it works:** A pre-shared token can be generated and shared between workers and the server to form a private, decentralized, p2p network. You can see the feature in action here: ![output](https://togithub.com/mudler/LocalAI/assets/2420543/8ca277cf-c208-4562-8929-808b2324b584) ##### How to use it 1. Start the server with `--p2p`: ```bash ./local-ai run --p2p ##### 1:02AM INF loading environment variables from file envFile=.env ##### 1:02AM INF Setting logging to info ##### 1:02AM INF P2P mode enabled ##### 1:02AM INF No token provided, generating one ##### 1:02AM INF Generated Token: ##### XXXXXXXXXXX ##### 1:02AM INF Press a button to proceed ``` A token is displayed, copy it and press enter. You can re-use the same token later restarting the server with `--p2ptoken` (or `P2P_TOKEN`). 2. Start the workers. Now you can copy the local-ai binary in other hosts, and run as many workers with that token: ```bash TOKEN=XXX ./local-ai p2p-llama-cpp-rpc ##### 1:06AM INF loading environment variables from file envFile=.env ##### 1:06AM INF Setting logging to info ##### {"level":"INFO","time":"2024-05-19T01:06:01.794+0200","caller":"config/config.go:288","message":"connmanager disabled\n"} ##### {"level":"INFO","time":"2024-05-19T01:06:01.794+0200","caller":"config/config.go:295","message":" go-libp2p resource manager protection enabled"} ##### {"level":"INFO","time":"2024-05-19T01:06:01.794+0200","caller":"config/config.go:409","message":"max connections: 100\n"} ##### 1:06AM INF Starting llama-cpp-rpc-server on '127.0.0.1:34371' ##### {"level":"INFO","time":"2024-05-19T01:06:01.794+0200","caller":"node/node.go:118","message":" Starting EdgeVPN network"} ##### create_backend: using CPU backend ##### Starting RPC server on 127.0.0.1:34371, backend memory: 31913 MB ##### 2024/05/19 01:06:01 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). # See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details. ##### {"level":"INFO","time":"2024-05-19T01:06:01.805+0200","caller":"node/node.go:172","message":" Node ID: 12D3KooWJ7WQAbCWKfJgjw2oMMGGss9diw3Sov5hVWi8t4DMgx92"} ##### {"level":"INFO","time":"2024-05-19T01:06:01.806+0200","caller":"node/node.go:173","message":" Node Addresses: [/ip4/127.0.0.1/tcp/44931 /ip4/127.0.0.1/udp/33251/quic-v1/webtransport/certhash/uEiAWAhZ-W9yx2ZHnKQm3BE_ft5jjoc468z5-Rgr9XdfjeQ/certhash/uEiB8Uwn0M2TQBELaV2m4lqypIAY2S-2ZMf7lt_N5LS6ojw /ip4/127.0.0.1/udp/35660/quic-v1 /ip4/192.168.68.110/tcp/44931 /ip4/192.168.68.110/udp/33251/quic-v1/webtransport/certhash/uEiAWAhZ-W9yx2ZHnKQm3BE_ft5jjoc468z5-Rgr9XdfjeQ/certhash/uEiB8Uwn0M2TQBELaV2m4lqypIAY2S-2ZMf7lt_N5LS6ojw /ip4/192.168.68.110/udp/35660/quic-v1 /ip6/::1/tcp/41289 /ip6/::1/udp/33160/quic-v1/webtransport/certhash/uEiAWAhZ-W9yx2ZHnKQm3BE_ft5jjoc468z5-Rgr9XdfjeQ/certhash/uEiB8Uwn0M2TQBELaV2m4lqypIAY2S-2ZMf7lt_N5LS6ojw /ip6/::1/udp/35701/quic-v1]"} ##### {"level":"INFO","time":"2024-05-19T01:06:01.806+0200","caller":"discovery/dht.go:104","message":" Bootstrapping DHT"} ``` (Note you can also supply the token via args) At this point, you should see in the server logs messages stating that new workers are found 3. Now you can start doing inference as usual on the server (the node used on step 1) Interested in to try it out? As we are still updating the documentation, you can read the full instructions here [mudler/LocalAI#2343 ##### 📜 Advanced Function calling support with Mixed JSON Grammars LocalAI gets better at function calling with mixed grammars! With this release, LocalAI introduces a transformative capability: support for mixed JSON BNF grammars. It allows to specify a grammar for the LLM that allows to output structured JSON and free text. **How to use it:** To enable mixed grammars, you can set in the `YAML` configuration file `function.mixed_mode = true`, for example: ```yaml function: ##### disable injecting the "answer" tool disable_no_action: true grammar: ##### This allows the grammar to also return messages mixed_mode: true ``` This feature significantly enhances LocalAI's ability to interpret and manipulate JSON data coming from the LLM through a more flexible and powerful grammar system. Users can now combine multiple grammar types within a single JSON structure, allowing for dynamic parsing and validation scenarios. Grammars can also turned off entirely and leave the user to determine how the data is parsed from the LLM to be correctly interpretated by LocalAI to be still compliant to the OpenAI REST spec. For example, to interpret Hermes results, one can just annotate regexes in `function.json_regex_match` to extract the LLM response: ```yaml function: grammar: disable: true ##### disable injecting the "answer" tool disable_no_action: true return_name_in_function_response: true json_regex_match: - "(?s)<tool_call>(.*?)</tool_call>" - "(?s)<tool_call>(.*?)" replace_llm_results: ##### Drop the scratchpad content from responses - key: "(?s)<scratchpad>.*</scratchpad>" value: "" replace_function_results: ##### Replace everything that is not JSON array or object, just in case. - key: '(?s)^[^{\[]*' value: "" - key: '(?s)[^}\]]*$' value: "" ##### Drop the scratchpad content from responses - key: "(?s)<scratchpad>.*</scratchpad>" value: "" ``` Note that regex can still be used when enabling mixed grammars is enabled. This is especially important for models which does not support grammars - such as transformers or OpenVINO models, that now can support as well function calling. As we update the docs, further documentation can be found in the PRs that you can find in the changelog below. ##### 🚀 New Model Additions and Updates ![local-ai-yi-updates](https://togithub.com/mudler/LocalAI/assets/2420543/5d646703-0c64-4299-b551-a39074f63c2d) Our model gallery continues to grow with exciting new additions like Aya-35b, Mistral-0.3, Hermes-Theta and updates to existing models ensuring they remain at the cutting edge. This release is having major enhancements on tool calling support. Besides working on making our default models in AIO images more performant - now you can try an enhanced out-of-the-box experience with function calling in the Hermes model family ( [Hermes-2-Pro-Mistral](https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF) and [Hermes-2-Theta-Llama-3](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF)) ##### Our LocalAI function model! ![local-ai-functioncall-model](https://togithub.com/mudler/LocalAI/assets/2420543/b2955459-49b6-4a57-96e8-242966ccef12) I have fine-tuned a function call model specific to leverage entirely the grammar support of LocalAI, you can find it in the model gallery already and on [huggingface](https://huggingface.co/mudler/LocalAI-Llama3-8b-Function-Call-v0.2) ##### 🔄 Single Binary Release: Simplified Deployment and Management In our continuous effort to streamline the user experience and deployment process, LocalAI v2.16.0 proudly introduces a single binary release. This enhancement, thanks to [@sozercan](https://togithub.com/sozercan)'s contributions, consolidates all variants (CUDA and non-cuda releases) and dependencies into one compact executable file. This change simplifies the installation and update processes, reduces compatibility issues, and speeds up the setup for new users and existing deployments as now binary releases are even more portable than ever! ##### 🔧 Bug Fixes and Improvements A host of bug fixes have been implemented to ensure smoother operation and integration. Key fixes include enhancements to the Intel build process, stability adjustments for setuptools in Python backends, and critical updates ensuring the successful build of p2p configurations. ##### Migrating Python Backends: From Conda to UV LocalAI has migrated its Python backends from Conda to UV. This transition, thanks to [@cryptk](https://togithub.com/cryptk) contributions, enhances the efficiency and scalability of our backend operations. Users will experience faster setup times and reduced complexity, streamlining the development process and making it easier to manage dependencies across different environments. ##### 📣 Let's Make Some Noise! A gigantic THANK YOU to everyone who’s contributed—your feedback, bug squashing, and feature suggestions are what make LocalAI shine. To all our heroes out there supporting other users and sharing their expertise, you’re the real MVPs! Remember, LocalAI thrives on community support—not big corporate bucks. If you love what we're building, show some love! A shoutout on social (@LocalAI_OSS and @mudler_it on twitter/X), joining our sponsors, or simply starring us on GitHub makes all the difference. Also, if you haven't yet joined our Discord, come on over! Here's the link: https://discord.gg/uJAeKSAGDy Thanks a ton, and.. enjoy this release! ##### What's Changed ##### Bug fixes 🐛 - build: do not specify a BUILD_ID by default by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2284 - fix: add missing openvino/optimum/etc libraries for Intel, fixes [#2289](https://togithub.com/mudler/LocalAI/issues/2289) by [@cryptk](https://togithub.com/cryptk) in [mudler/LocalAI#2292 - add setuptools for openvino by [@fakezeta](https://togithub.com/fakezeta) in [mudler/LocalAI#2301 - fix: add setuptools to all requirements-intel.txt files for python backends by [@cryptk](https://togithub.com/cryptk) in [mudler/LocalAI#2333 - ci: correctly build p2p in GO_TAGS by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2369 - ci: generate specific image for intel builds by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2374 - fix: stablediffusion binary by [@sozercan](https://togithub.com/sozercan) in [mudler/LocalAI#2385 ##### Exciting New Features 🎉 - feat: migrate python backends from conda to uv by [@cryptk](https://togithub.com/cryptk) in [mudler/LocalAI#2215 - feat: create bash library to handle install/run/test of python backends by [@cryptk](https://togithub.com/cryptk) in [mudler/LocalAI#2286 - feat(grammar): support models with specific construct by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2291 - feat(ui): display number of available models for installation by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2298 - feat: auto select llama-cpp cpu variant by [@sozercan](https://togithub.com/sozercan) in [mudler/LocalAI#2305 - feat(llama.cpp): add `flash_attention` and `no_kv_offloading` by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2310 - feat(functions): support models with no grammar and no regex by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2315 - feat(functions): allow to set JSON matcher by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2319 - feat: auto select llama-cpp cuda runtime by [@sozercan](https://togithub.com/sozercan) in [mudler/LocalAI#2306 - feat(llama.cpp): add distributed llama.cpp inferencing by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2324 - feat(functions): mixed JSON BNF grammars by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2328 - feat(functions): simplify parsing, read functions as list by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2340 - feat(functions): Enable true regex replacement for the regexReplacement option by [@lenaxia](https://togithub.com/lenaxia) in [mudler/LocalAI#2341 - feat(backends): add openvoice backend by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2334 - feat(webui): statically embed js/css assets by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2348 - feat(functions): allow to use JSONRegexMatch unconditionally by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2349 - feat(functions): don't use yaml.MapSlice by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2354 - build: add sha by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2356 - feat(llama.cpp): Totally decentralized, private, distributed, p2p inference by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2343 - feat(functions): relax mixedgrammars by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2365 - models(gallery): add mistral-0.3 and command-r, update functions by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2388 ##### 🧠 Models - models(gallery): add aloe by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2283 - models(gallery): add Llama-3-8B-Instruct-abliterated by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2288 - models(gallery): add l3-chaoticsoliloquy-v1.5-4x8b by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2295 - models(gallery): add jsl-medllama-3-8b-v2.0 by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2296 - models(gallery): add llama-3-refueled by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2297 - models(gallery): add aura-llama-Abliterated by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2309 - models(gallery): add Bunny-llama by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2311 - models(gallery): add lumimaidv2 by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2312 - models(gallery): add orthocopter by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2313 - fix(gallery) Correct llama3-8b-instruct model file by [@tannisroot](https://togithub.com/tannisroot) in [mudler/LocalAI#2330 - models(gallery): add hermes-2-theta-llama-3-8b by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2331 - models(gallery): add yi 6/9b, sqlcoder, sfr-iterative-dpo by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2335 - models(gallery): add anita by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2344 - models(gallery): add master-yi by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2345 - models(gallery): update poppy porpoise mmproj by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2346 - models(gallery): add LocalAI-Llama3-8b-Function-Call-v0.2-GGUF by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2355 - models(gallery): add stheno by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2358 - fix(gallery): checksum Meta-Llama-3-70B-Instruct.Q4\_K_M.gguf - [#2364](https://togithub.com/mudler/LocalAI/issues/2364) by [@Nold360](https://togithub.com/Nold360) in [mudler/LocalAI#2366 - models(gallery): add phi-3-medium-4k-instruct by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2367 - models(gallery): add hercules and helpingAI by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2376 - ci(checksum_checker): do get sha from hf API when available by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2380 - models(gallery): ⬆️ update checksum by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2383 - models(gallery): ⬆️ update checksum by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2386 - models(gallery): add aya-35b by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2391 ##### 📖 Documentation and examples - docs: Update semantic-todo/README.md by [@eltociear](https://togithub.com/eltociear) in [mudler/LocalAI#2294 - Add Home Assistant Integration by [@valentinfrlch](https://togithub.com/valentinfrlch) in [mudler/LocalAI#2387 - Add warning for running the binary on MacOS by [@mauromorales](https://togithub.com/mauromorales) in [mudler/LocalAI#2389 ##### 👒 Dependencies - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2281 - ⬆️ Update docs version mudler/LocalAI by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2280 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2285 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2290 - feat(swagger): update swagger by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2302 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2303 - ⬆️ Update ggerganov/whisper.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2317 - ⬆️ Update ggerganov/whisper.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2326 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2316 - ⬆️ Update ggerganov/whisper.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2329 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2337 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2339 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2342 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2351 - ⬆️ Update ggerganov/whisper.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2352 - dependencies(grpcio): bump to fix CI issues by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2362 - deps(llama.cpp): update and adapt API changes by [@mudler](https://togithub.com/mudler) in [mudler/LocalAI#2381 - ⬆️ Update ggerganov/whisper.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2361 - ⬆️ Update go-skynet/go-bert.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#1225 - ⬆️ Update ggerganov/llama.cpp by [@localai-bot](https://togithub.com/localai-bot) in [mudler/LocalAI#2360 ##### Other Changes - refactor: Minor improvements to BackendConfigLoader by [@dave-gray101](https://togithub.com/dave-gray101) in [mudler/LocalAI#2353 ##### New Contributors - [@tannisroot](https://togithub.com/tannisroot) made their first contribution in [mudler/LocalAI#2330 - [@lenaxia](https://togithub.com/lenaxia) made their first contribution in [mudler/LocalAI#2341 - [@valentinfrlch](https://togithub.com/valentinfrlch) made their first contribution in [mudler/LocalAI#2387 **Full Changelog**: mudler/LocalAI@v2.15.0...v2.16.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://togithub.com/renovatebot/renovate).

rgerganov marked this pull request as draft April 22, 2024 14:53

rgerganov force-pushed the rpc branch from 2ef868d to eaf1543 Compare April 29, 2024 11:40

rgerganov force-pushed the rpc branch from ac36849 to 3d207c7 Compare April 29, 2024 13:19

ggerganov mentioned this pull request Apr 29, 2024

off topic: linking two Mac Studio together to fit larger models #6390

Closed

slaren reviewed Apr 29, 2024

View reviewed changes

llama.h Outdated Show resolved Hide resolved

rgerganov marked this pull request as ready for review April 30, 2024 10:25

ggerganov reviewed Apr 30, 2024

View reviewed changes

ggml-rpc.cpp Show resolved Hide resolved

slaren reviewed Apr 30, 2024

View reviewed changes