Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

Open
4 tasks done
khimaros opened this issue May 9, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@khimaros
Copy link
Contributor

khimaros commented May 9, 2024

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Motivation

third party applications are overwhelmingly slow for subsequent prompt evaluation. where a subsequent prompt in the examples/server web interface can be evaluated in seconds, longer chats in these applications can take several minutes just to begin generating additional text.

i believe there are two separate issues:

  • users of the OpenAI compatible endpoint in examples/server are not taking advantage of the prompt cache
  • users of the llama-cpp-python high level API (including the server it ships with) are not taking advantage of the prompt cache

Description

N.B. it is possible that this is only a documentation issue.

Request: provide a well-lit path for consumers of the llama.cpp API and the OpenAI compatible examples/server endpoint to avoid reprocessing the full chat history on each subsequent prompt evaluation.

i suspect there is a usability or discoverability issue with the llama.cpp APIs which is leading to inefficient use of llama.cpp. i've tested many llama.cpp based apps on Linux and Android (many listed in the README) and all of them struggle with this problem.

  • llama-cpp-python[server]
  • oodabooga/text-generation-webui
  • KoboldCPP
  • Mobile-Artificial-Intelligence/maid (using examples/server API)
  • ztjhz/BetterChatGPT (using examples/server API)

in the case of text-generation-webui and KoboldCpp, i tested both the builtin (llama-cpp-python based) inference as well as using them as an API client for examples/server endpoint. Both suffer from this problem.

examples/main and examples/server are the only two pieces of software i've tested which handle this well, which results in these two simple examples being the most performant way to interact with LLMs.

the high level llama-cpp-python API seems to be perpetuating this mistake, which has follow-on effects for other consumers such as oodabooga-webui: abetlen/llama-cpp-python#181 (don't be fooled by the closed status, the issue persists)

@khimaros khimaros added the enhancement New feature or request label May 9, 2024
@khimaros
Copy link
Contributor Author

khimaros commented May 9, 2024

digging a bit deeper into the reason for the speed of the examples/server frontend: it looks like this frontend uses the completion API rather than the /v1/chat/completions API used by other OpenAI compatible frontends. i suspect that the use of cache_prompt: true in its requests is also significant.

i tested with some manual curl'ing and it seems like the cache_prompt parameter is also accepted on the /v1/chat/completions endpoint. i'll be changing the default slot value in examples/server/server.cpp and testing out a few clients to see if this helps with performance.

@khimaros
Copy link
Contributor Author

khimaros commented May 9, 2024

i'm testing with the following patch:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index ff0814b2..0464280e 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -98,7 +98,7 @@ struct server_task_multi {
 
 struct slot_params {
     bool stream       = true;
-    bool cache_prompt = false; // remember the prompt to avoid reprocessing all prompt
+    bool cache_prompt = true; // remember the prompt to avoid reprocessing all prompt
 
     uint32_t seed      = -1; // RNG seed
     int32_t  n_keep    =  0; // number of tokens to keep from initial prompt
@@ -834,7 +834,7 @@ struct server_context {
         }
 
         slot.params.stream             = json_value(data, "stream",            false);
-        slot.params.cache_prompt       = json_value(data, "cache_prompt",      false);
+        slot.params.cache_prompt       = json_value(data, "cache_prompt",      true);
         slot.params.n_predict          = json_value(data, "n_predict",         default_params.n_predict);
         slot.sparams.top_k             = json_value(data, "top_k",             default_sparams.top_k);
         slot.sparams.top_p             = json_value(data, "top_p",             default_sparams.top_p);

this didn't help any of the clients that i tested.

moving onto some manual testing with curl/hurl.

i'm sending what should be a purely additive sequence of requests (using a static seed) which seems like it should pull from the cache.

the first request:

POST http://127.0.0.1:8080/v1/chat/completions
{
        "model": "gpt-3.5-turbo",
        "cache_prompt": true,
        "messages": [
                {"role": "system", "content": "You are a helpful AI assistant."},
                {"role": "user", "content": "What is 2 + 2?"}
        ],
        "temperature": 0.8,
        "seed": 23,
        "top_p": 1.0,
        "min_p": 0.05,
        "stream": false,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "n": 1,
        "top_k": 40,
        "repeat_penalty": 1.1,
        "mirostat_mode": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1
}

response body:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The answer is 4! This is a very basic addition problem. Are you looking for help with any other simple math questions? \n",
        "role": "assistant"
      }
    }
  ],
  "created": 1715288809,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 28,
    "prompt_tokens": 25,
    "total_tokens": 53
  },
  "id": "chatcmpl-Ikv3dt0Z3FerIQdxR0Kl99RaG2cVqCG5"
}

and the server log for that request:

{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":255}
{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"update_slots","line":2043,"msg":"we have to evaluate at least 1 token to generate logits","id_slot":0,"id_task":255}
{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"update_slots","line":2087,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":255,"p0":24}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":313,"msg":"prompt eval time     =     806.35 ms /     1 tokens (  806.35 ms per token,     1.24 tokens per second)","id_slot":0,"id_task":255,"t_prompt_processing":806.349,"n_prompt_tokens_processed":1,"t_token":806.349,"n_tokens_second":1.2401577976781766}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":329,"msg":"generation eval time =   23563.84 ms /    28 runs   (  841.57 ms per token,     1.19 tokens per second)","id_slot":0,"id_task":255,"t_token_generation":23563.84,"n_decoded":28,"t_token":841.5657142857143,"n_tokens_second":1.1882613360131455}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":340,"msg":"          total time =   24370.19 ms","id_slot":0,"id_task":255,"t_prompt_processing":806.349,"t_token_generation":23563.84,"t_total":24370.189}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":1781,"msg":"slot released","id_slot":0,"id_task":255,"n_ctx":4096,"n_past":52,"n_system_tokens":0,"n_cache_tokens":52,"truncated":false}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140551452063424","timestamp":1715288809,"level":"INFO","function":"log_server_request","line":2862,"msg":"request","remote_addr":"127.0.0.1","remote_port":33458,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

the subsequent request:

POST http://127.0.0.1:8080/v1/chat/completions
{
        "model": "gpt-3.5-turbo",
        "cache_prompt": true,
        "messages": [
                {"role": "system", "content": "You are a helpful AI assistant."},
                {"role": "user", "content": "What is 2 + 2?"},
                {"role": "assistant", "content": "The answer is 4! This is a very basic addition problem. Are you looking for help with any other simple math questions? \n"},
                {"role": "user", "content": "What is 4 + 4?"}
        ],
        "temperature": 0.8,
        "seed": 23,
        "top_p": 1.0,
        "min_p": 0.05,
        "stream": false,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "n": 1,
        "top_k": 40,
        "repeat_penalty": 1.1,
        "mirostat_mode": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1
}

response body:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The answer to this one is also 4! It's another straightforward addition problem: the two fours add up to make eight, and then half of eight is four. Easy peasy!",
        "role": "assistant"
      }
    }
  ],
  "created": 1715288852,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 39,
    "prompt_tokens": 66,
    "total_tokens": 105
  },
  "id": "chatcmpl-2Thkbuwu0V4w4TgfqI14kEjnaKQeC129"
}

and the server log:

{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":284}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":2087,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":284,"p0":51}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":313,"msg":"prompt eval time     =    7871.37 ms /    15 tokens (  524.76 ms per token,     1.91 tokens per second)","id_slot":0,"id_task":284,"t_prompt_processing":7871.371,"n_prompt_tokens_processed":15,"t_token":524.7580666666667,"n_tokens_second":1.9056400721043387}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":329,"msg":"generation eval time =   34750.44 ms /    39 runs   (  891.04 ms per token,     1.12 tokens per second)","id_slot":0,"id_task":284,"t_token_generation":34750.439,"n_decoded":39,"t_token":891.0368974358973,"n_tokens_second":1.122287980304364}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":340,"msg":"          total time =   42621.81 ms","id_slot":0,"id_task":284,"t_prompt_processing":7871.371,"t_token_generation":34750.439,"t_total":42621.81}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"update_slots","line":1781,"msg":"slot released","id_slot":0,"id_task":284,"n_ctx":4096,"n_past":104,"n_system_tokens":0,"n_cache_tokens":104,"truncated":false}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140551452063424","timestamp":1715288852,"level":"INFO","function":"log_server_request","line":2862,"msg":"request","remote_addr":"127.0.0.1","remote_port":33458,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

it's clear from the server log of the second request that the full prompt is being evaluated. it takes much longer than just the incremental prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant