Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048 #13139

Conversation

komal-SkyNET
Copy link

@komal-SkyNET komal-SkyNET commented Apr 28, 2024

Description

Ollama library in llama index is set to CONTEXT WINDOW to 3094 n_ctx = 3904 which is higher than Ollama CLI default of 2048 n_ctx = 2048 causing weird output when running even simple queries on llama3:instruct model. This thereby deters new users/developers from being able to get llama-index up and running quickly with Ollama (llama3). In contrast, langchain works out of the box with the same configuration that aligns with the Ollama CLI so that performance/consistency is retained out of the box. In other words, running a query on Ollama interactive CLI ollama run, and running it on Llama-index library (with defaults) should be identical.

Fixes #13106

Fixes Timeout, Junk Output, and ggml_metal_graph_compute: command buffer 3 failed with status 5 error due to mismatch in default Context Window between Ollama CLI and Llama-index integration.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No - It's not a new integration

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Just a change in default value.

To reproduce:

Hardware/bootstrap logs:

llama_new_context_with_model: n_ctx      = 3904
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   488.00 MiB, ( 5056.88 /  5461.34)
llama_kv_cache_init:      Metal KV buffer size =   488.00 MiB
llama_new_context_with_model: KV self size  =  488.00 MiB, K (f16):  244.00 MiB, V (f16):  244.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   283.64 MiB, ( 5340.52 /  5461.34)
llama_new_context_with_model:      Metal compute buffer size =   283.63 MiB
llama_new_context_with_model:        CPU compute buffer size =    15.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

Command:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3:instruct",
  "prompt": "Why is the sky blue?",   "options": {
    "num_ctx": 3904
  }
}'

On Llama-index:

llm = Ollama(model="llama3", request_timeout=120)
llm.complete("Why is the sky blue?")

Output:

{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.109309Z","response":"3","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.192524Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.261377Z","response":"8","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.330906Z","response":";","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.399573Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.479188Z","response":"#","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.542696Z","response":"/","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.616321Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.689998Z","response":")","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.759423Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.82956Z","response":"8","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.895623Z","response":"5","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.959387Z","response":":","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.029689Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.095651Z","response":"5","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.168368Z","response":"7","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.241777Z","response":"0","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.323894Z","response":"6","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.397235Z","response":"/","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.467832Z","response":"2","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.565109Z","response":"7","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.654553Z","response":"6","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.727295Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.793783Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.864981Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.95098Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.024172Z","response":"'","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.103524Z","response":"2","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.187653Z","response":":","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.253375Z","response":"3","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.337022Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.399163Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.467277Z","response":"\"","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.537366Z","response":"4","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.597877Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.671598Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.734082Z","response":"%","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.798676Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.864071Z","response":"\u003e","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.931211Z","response":"(","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.993299Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.054537Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.119623Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.186948Z","response":"C","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.264504Z","response":"C","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.330198Z","response":"F","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.395072Z","response":"=","done":false}

Logs from Ollama server:

llama_new_context_with_model: n_ctx      = 3904
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   488.00 MiB, ( 5056.88 /  5461.34)
llama_kv_cache_init:      Metal KV buffer size =   488.00 MiB
llama_new_context_with_model: KV self size  =  488.00 MiB, K (f16):  244.00 MiB, V (f16):  244.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   283.64 MiB, ( 5340.52 /  5461.34)
llama_new_context_with_model:      Metal compute buffer size =   283.63 MiB
llama_new_context_with_model:        CPU compute buffer size =    15.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: command buffer 3 failed with status 5
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":3904,"slot_id":0,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"0x1df38d000","timestamp":1714271140}
{"function":"validate_model_chat_template","level":"ERR","line":437,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"0x1df38d000","timestamp":1714271140}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"7","port":"65236","tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":1,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":2,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65241,"status":200,"tid":"0x17dd13000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65237,"status":200,"tid":"0x17dae3000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":3,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65239,"status":200,"tid":"0x17dbfb000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":4,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65238,"status":200,"tid":"0x17db6f000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65240,"status":200,"tid":"0x17dc87000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":5,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65273,"status":200,"tid":"0x17dd9f000","timestamp":1714271140}
time=2024-04-28T12:25:40.713+10:00 level=DEBUG source=server.go:431 msg="llama runner started in 7.403734 seconds"
time=2024-04-28T12:25:40.718+10:00 level=DEBUG source=routes.go:259 msg="generate handler" prompt="Why is the sky blue?"
time=2024-04-28T12:25:40.719+10:00 level=DEBUG source=routes.go:260 msg="generate handler" template="{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>"
time=2024-04-28T12:25:40.719+10:00 level=DEBUG source=routes.go:261 msg="generate handler" system=""
time=2024-04-28T12:25:40.723+10:00 level=DEBUG source=routes.go:292 msg="generate handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":6,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65273,"status":200,"tid":"0x17dd9f000","timestamp":1714271140}
{"function":"launch_slot_with_data","level":"INFO","line":833,"msg":"slot is processing task","slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1816,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":16,"slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","level":"INFO","line":1840,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 28, 2024
@logan-markewich
Copy link
Collaborator

2048 is extremely small for most RAG use cases.

Personally I've never had issues using ollama 😅 super confused why this specific setting would cause issues? Llama3 for example has like an 8k context window?

@komal-SkyNET
Copy link
Author

komal-SkyNET commented Apr 28, 2024

@logan-markewich I'd assume we'd want to set the defaults to a value that is a common denominator - something that can work for everyone out of the box without tweaking. The library in its current state fails silently as a timeout without any trace of the underlying issue for a machine like Apple M1 8GB memory – which I think can be a reasonable baseline we can set. Furthermore, the CLI ollama run llama3 instantiates with a context window of 2048 where user will a same query pass in CLI, but fail on llama-index.
For context, this is referring to llama3:instruct (quantized model) ~ 4.7GB.

@komal-SkyNET
Copy link
Author

@logan-markewich Not to mention langchain:llama3 works out of the box in its default settings but llama-index:llama3 doesn't (in the context of any machine equivalent to 8GB Apple M1 machine)

@yisding
Copy link
Collaborator

yisding commented Apr 30, 2024

I agree with Logan here that 2K is too small for many RAG applications. In fact, we should be going higher to 8K for Llama 3 and 64K for Mixtral 4x22b.

That said, I hear @komal-SkyNET about the difficulties when running on machines with 8GB of system RAM, so let's reach out to ollama to see if they can give us back some kind of error in that scenario. If not, we can do a quick and dirty hack using psutil. Actually doing some kind of psutil check might not be a bad idea regardless to prevent us from locking up users' computers like the first time I tried using our ollama integration (and that was with 16GB of RAM!)

@jmorganca
Copy link

Hi folks I work on Ollama - sorry you hit this issue! A fix is on the way and will be in the next release ollama/ollama#4068

@logan-markewich
Copy link
Collaborator

In light of the above, going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question]: Why is this llama-index query to Ollama (llama3) always timing out?
4 participants