third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

khimaros · 2024-05-09T19:58:47Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Motivation

third party applications are overwhelmingly slow for subsequent prompt evaluation. where a subsequent prompt in the examples/server web interface can be evaluated in seconds, longer chats in these applications can take several minutes just to begin generating additional text.

i believe there are two separate issues:

users of the OpenAI compatible endpoint in examples/server are not taking advantage of the prompt cache
users of the llama-cpp-python high level API (including the server it ships with) are not taking advantage of the prompt cache

Description

N.B. it is possible that this is only a documentation issue.

Request: provide a well-lit path for consumers of the llama.cpp API and the OpenAI compatible examples/server endpoint to avoid reprocessing the full chat history on each subsequent prompt evaluation.

i suspect there is a usability or discoverability issue with the llama.cpp APIs which is leading to inefficient use of llama.cpp. i've tested many llama.cpp based apps on Linux and Android (many listed in the README) and all of them struggle with this problem.

llama-cpp-python[server]
oodabooga/text-generation-webui
KoboldCPP
Mobile-Artificial-Intelligence/maid (using examples/server API)
ztjhz/BetterChatGPT (using examples/server API)

in the case of text-generation-webui and KoboldCpp, i tested both the builtin (llama-cpp-python based) inference as well as using them as an API client for examples/server endpoint. Both suffer from this problem.

examples/main and examples/server are the only two pieces of software i've tested which handle this well, which results in these two simple examples being the most performant way to interact with LLMs.

the high level llama-cpp-python API seems to be perpetuating this mistake, which has follow-on effects for other consumers such as oodabooga-webui: abetlen/llama-cpp-python#181 (don't be fooled by the closed status, the issue persists)

khimaros · 2024-05-09T20:38:17Z

digging a bit deeper into the reason for the speed of the examples/server frontend: it looks like this frontend uses the completion API rather than the /v1/chat/completions API used by other OpenAI compatible frontends. i suspect that the use of cache_prompt: true in its requests is also significant.

i tested with some manual curl'ing and it seems like the cache_prompt parameter is also accepted on the /v1/chat/completions endpoint. i'll be changing the default slot value in examples/server/server.cpp and testing out a few clients to see if this helps with performance.

khimaros · 2024-05-09T21:19:41Z

i'm testing with the following patch:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index ff0814b2..0464280e 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -98,7 +98,7 @@ struct server_task_multi {
 
 struct slot_params {
     bool stream       = true;
-    bool cache_prompt = false; // remember the prompt to avoid reprocessing all prompt
+    bool cache_prompt = true; // remember the prompt to avoid reprocessing all prompt
 
     uint32_t seed      = -1; // RNG seed
     int32_t  n_keep    =  0; // number of tokens to keep from initial prompt
@@ -834,7 +834,7 @@ struct server_context {
         }
 
         slot.params.stream             = json_value(data, "stream",            false);
-        slot.params.cache_prompt       = json_value(data, "cache_prompt",      false);
+        slot.params.cache_prompt       = json_value(data, "cache_prompt",      true);
         slot.params.n_predict          = json_value(data, "n_predict",         default_params.n_predict);
         slot.sparams.top_k             = json_value(data, "top_k",             default_sparams.top_k);
         slot.sparams.top_p             = json_value(data, "top_p",             default_sparams.top_p);

this didn't help any of the clients that i tested.

moving onto some manual testing with curl/hurl.

i'm sending what should be a purely additive sequence of requests (using a static seed) which seems like it should pull from the cache.

the first request:

POST http://127.0.0.1:8080/v1/chat/completions
{
        "model": "gpt-3.5-turbo",
        "cache_prompt": true,
        "messages": [
                {"role": "system", "content": "You are a helpful AI assistant."},
                {"role": "user", "content": "What is 2 + 2?"}
        ],
        "temperature": 0.8,
        "seed": 23,
        "top_p": 1.0,
        "min_p": 0.05,
        "stream": false,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "n": 1,
        "top_k": 40,
        "repeat_penalty": 1.1,
        "mirostat_mode": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1
}

response body:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The answer is 4! This is a very basic addition problem. Are you looking for help with any other simple math questions? \n",
        "role": "assistant"
      }
    }
  ],
  "created": 1715288809,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 28,
    "prompt_tokens": 25,
    "total_tokens": 53
  },
  "id": "chatcmpl-Ikv3dt0Z3FerIQdxR0Kl99RaG2cVqCG5"
}

and the server log for that request:

{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":255}
{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"update_slots","line":2043,"msg":"we have to evaluate at least 1 token to generate logits","id_slot":0,"id_task":255}
{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"update_slots","line":2087,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":255,"p0":24}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":313,"msg":"prompt eval time     =     806.35 ms /     1 tokens (  806.35 ms per token,     1.24 tokens per second)","id_slot":0,"id_task":255,"t_prompt_processing":806.349,"n_prompt_tokens_processed":1,"t_token":806.349,"n_tokens_second":1.2401577976781766}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":329,"msg":"generation eval time =   23563.84 ms /    28 runs   (  841.57 ms per token,     1.19 tokens per second)","id_slot":0,"id_task":255,"t_token_generation":23563.84,"n_decoded":28,"t_token":841.5657142857143,"n_tokens_second":1.1882613360131455}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":340,"msg":"          total time =   24370.19 ms","id_slot":0,"id_task":255,"t_prompt_processing":806.349,"t_token_generation":23563.84,"t_total":24370.189}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":1781,"msg":"slot released","id_slot":0,"id_task":255,"n_ctx":4096,"n_past":52,"n_system_tokens":0,"n_cache_tokens":52,"truncated":false}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140551452063424","timestamp":1715288809,"level":"INFO","function":"log_server_request","line":2862,"msg":"request","remote_addr":"127.0.0.1","remote_port":33458,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

the subsequent request:

POST http://127.0.0.1:8080/v1/chat/completions
{
        "model": "gpt-3.5-turbo",
        "cache_prompt": true,
        "messages": [
                {"role": "system", "content": "You are a helpful AI assistant."},
                {"role": "user", "content": "What is 2 + 2?"},
                {"role": "assistant", "content": "The answer is 4! This is a very basic addition problem. Are you looking for help with any other simple math questions? \n"},
                {"role": "user", "content": "What is 4 + 4?"}
        ],
        "temperature": 0.8,
        "seed": 23,
        "top_p": 1.0,
        "min_p": 0.05,
        "stream": false,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "n": 1,
        "top_k": 40,
        "repeat_penalty": 1.1,
        "mirostat_mode": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1
}

response body:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The answer to this one is also 4! It's another straightforward addition problem: the two fours add up to make eight, and then half of eight is four. Easy peasy!",
        "role": "assistant"
      }
    }
  ],
  "created": 1715288852,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 39,
    "prompt_tokens": 66,
    "total_tokens": 105
  },
  "id": "chatcmpl-2Thkbuwu0V4w4TgfqI14kEjnaKQeC129"
}

and the server log:

{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":284}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":2087,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":284,"p0":51}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":313,"msg":"prompt eval time     =    7871.37 ms /    15 tokens (  524.76 ms per token,     1.91 tokens per second)","id_slot":0,"id_task":284,"t_prompt_processing":7871.371,"n_prompt_tokens_processed":15,"t_token":524.7580666666667,"n_tokens_second":1.9056400721043387}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":329,"msg":"generation eval time =   34750.44 ms /    39 runs   (  891.04 ms per token,     1.12 tokens per second)","id_slot":0,"id_task":284,"t_token_generation":34750.439,"n_decoded":39,"t_token":891.0368974358973,"n_tokens_second":1.122287980304364}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":340,"msg":"          total time =   42621.81 ms","id_slot":0,"id_task":284,"t_prompt_processing":7871.371,"t_token_generation":34750.439,"t_total":42621.81}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"update_slots","line":1781,"msg":"slot released","id_slot":0,"id_task":284,"n_ctx":4096,"n_past":104,"n_system_tokens":0,"n_cache_tokens":104,"truncated":false}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140551452063424","timestamp":1715288852,"level":"INFO","function":"log_server_request","line":2862,"msg":"request","remote_addr":"127.0.0.1","remote_port":33458,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

it's clear from the server log of the second request that the full prompt is being evaluated. it takes much longer than just the incremental prompt.

khimaros added the enhancement New feature or request label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

khimaros commented May 9, 2024 •

edited

khimaros commented May 9, 2024 •

edited

khimaros commented May 9, 2024 •

edited

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

Comments

khimaros commented May 9, 2024 • edited

Prerequisites

Motivation

Description

khimaros commented May 9, 2024 • edited

khimaros commented May 9, 2024 • edited

khimaros commented May 9, 2024 •

edited

khimaros commented May 9, 2024 •

edited

khimaros commented May 9, 2024 •

edited