Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048 #13139

komal-SkyNET · 2024-04-28T03:38:39Z

Description

Ollama library in llama index is set to CONTEXT WINDOW to 3094 n_ctx = 3904 which is higher than Ollama CLI default of 2048 n_ctx = 2048 causing weird output when running even simple queries on llama3:instruct model. This thereby deters new users/developers from being able to get llama-index up and running quickly with Ollama (llama3). In contrast, langchain works out of the box with the same configuration that aligns with the Ollama CLI so that performance/consistency is retained out of the box. In other words, running a query on Ollama interactive CLI ollama run, and running it on Llama-index library (with defaults) should be identical.

Fixes #13106

Fixes Timeout, Junk Output, and ggml_metal_graph_compute: command buffer 3 failed with status 5 error due to mismatch in default Context Window between Ollama CLI and Llama-index integration.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No - It's not a new integration

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Just a change in default value.

To reproduce:

Hardware/bootstrap logs:

llama_new_context_with_model: n_ctx      = 3904
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   488.00 MiB, ( 5056.88 /  5461.34)
llama_kv_cache_init:      Metal KV buffer size =   488.00 MiB
llama_new_context_with_model: KV self size  =  488.00 MiB, K (f16):  244.00 MiB, V (f16):  244.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   283.64 MiB, ( 5340.52 /  5461.34)
llama_new_context_with_model:      Metal compute buffer size =   283.63 MiB
llama_new_context_with_model:        CPU compute buffer size =    15.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

Command:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3:instruct",
  "prompt": "Why is the sky blue?",   "options": {
    "num_ctx": 3904
  }
}'

On Llama-index:

llm = Ollama(model="llama3", request_timeout=120)
llm.complete("Why is the sky blue?")

Output:

{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.109309Z","response":"3","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.192524Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.261377Z","response":"8","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.330906Z","response":";","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.399573Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.479188Z","response":"#","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.542696Z","response":"/","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.616321Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.689998Z","response":")","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.759423Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.82956Z","response":"8","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.895623Z","response":"5","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.959387Z","response":":","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.029689Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.095651Z","response":"5","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.168368Z","response":"7","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.241777Z","response":"0","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.323894Z","response":"6","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.397235Z","response":"/","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.467832Z","response":"2","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.565109Z","response":"7","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.654553Z","response":"6","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.727295Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.793783Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.864981Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.95098Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.024172Z","response":"'","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.103524Z","response":"2","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.187653Z","response":":","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.253375Z","response":"3","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.337022Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.399163Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.467277Z","response":"\"","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.537366Z","response":"4","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.597877Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.671598Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.734082Z","response":"%","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.798676Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.864071Z","response":"\u003e","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.931211Z","response":"(","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.993299Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.054537Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.119623Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.186948Z","response":"C","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.264504Z","response":"C","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.330198Z","response":"F","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.395072Z","response":"=","done":false}

Logs from Ollama server:

llama_new_context_with_model: n_ctx      = 3904
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   488.00 MiB, ( 5056.88 /  5461.34)
llama_kv_cache_init:      Metal KV buffer size =   488.00 MiB
llama_new_context_with_model: KV self size  =  488.00 MiB, K (f16):  244.00 MiB, V (f16):  244.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   283.64 MiB, ( 5340.52 /  5461.34)
llama_new_context_with_model:      Metal compute buffer size =   283.63 MiB
llama_new_context_with_model:        CPU compute buffer size =    15.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: command buffer 3 failed with status 5
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":3904,"slot_id":0,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"0x1df38d000","timestamp":1714271140}
{"function":"validate_model_chat_template","level":"ERR","line":437,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"0x1df38d000","timestamp":1714271140}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"7","port":"65236","tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":1,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":2,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65241,"status":200,"tid":"0x17dd13000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65237,"status":200,"tid":"0x17dae3000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":3,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65239,"status":200,"tid":"0x17dbfb000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":4,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65238,"status":200,"tid":"0x17db6f000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65240,"status":200,"tid":"0x17dc87000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":5,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65273,"status":200,"tid":"0x17dd9f000","timestamp":1714271140}
time=2024-04-28T12:25:40.713+10:00 level=DEBUG source=server.go:431 msg="llama runner started in 7.403734 seconds"
time=2024-04-28T12:25:40.718+10:00 level=DEBUG source=routes.go:259 msg="generate handler" prompt="Why is the sky blue?"
time=2024-04-28T12:25:40.719+10:00 level=DEBUG source=routes.go:260 msg="generate handler" template="{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>"
time=2024-04-28T12:25:40.719+10:00 level=DEBUG source=routes.go:261 msg="generate handler" system=""
time=2024-04-28T12:25:40.723+10:00 level=DEBUG source=routes.go:292 msg="generate handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":6,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65273,"status":200,"tid":"0x17dd9f000","timestamp":1714271140}
{"function":"launch_slot_with_data","level":"INFO","line":833,"msg":"slot is processing task","slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1816,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":16,"slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","level":"INFO","line":1840,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

logan-markewich · 2024-04-28T04:37:58Z

2048 is extremely small for most RAG use cases.

Personally I've never had issues using ollama 😅 super confused why this specific setting would cause issues? Llama3 for example has like an 8k context window?

komal-SkyNET · 2024-04-28T07:48:31Z

@logan-markewich I'd assume we'd want to set the defaults to a value that is a common denominator - something that can work for everyone out of the box without tweaking. The library in its current state fails silently as a timeout without any trace of the underlying issue for a machine like Apple M1 8GB memory – which I think can be a reasonable baseline we can set. Furthermore, the CLI ollama run llama3 instantiates with a context window of 2048 where user will a same query pass in CLI, but fail on llama-index.
For context, this is referring to llama3:instruct (quantized model) ~ 4.7GB.

komal-SkyNET · 2024-04-28T07:51:18Z

@logan-markewich Not to mention langchain:llama3 works out of the box in its default settings but llama-index:llama3 doesn't (in the context of any machine equivalent to 8GB Apple M1 machine)

yisding · 2024-04-30T16:17:44Z

I agree with Logan here that 2K is too small for many RAG applications. In fact, we should be going higher to 8K for Llama 3 and 64K for Mixtral 4x22b.

That said, I hear @komal-SkyNET about the difficulties when running on machines with 8GB of system RAM, so let's reach out to ollama to see if they can give us back some kind of error in that scenario. If not, we can do a quick and dirty hack using psutil. Actually doing some kind of psutil check might not be a bad idea regardless to prevent us from locking up users' computers like the first time I tried using our ollama integration (and that was with 16GB of RAM!)

jmorganca · 2024-05-01T02:32:00Z

Hi folks I work on Ollama - sorry you hit this issue! A fix is on the way and will be in the next release ollama/ollama#4068

logan-markewich · 2024-05-02T03:39:47Z

In light of the above, going to close this.

decrease context window to match Ollama CLI

d36ddf2

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 28, 2024

130jd mentioned this pull request May 1, 2024

[Question]: Anyone else encounter Ollama Starter Example working once, then failing thereafter? #13188

Open

1 task

logan-markewich closed this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048 #13139

Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048 #13139

komal-SkyNET commented Apr 28, 2024 •

edited

logan-markewich commented Apr 28, 2024

komal-SkyNET commented Apr 28, 2024 •

edited

komal-SkyNET commented Apr 28, 2024

yisding commented Apr 30, 2024

jmorganca commented May 1, 2024

logan-markewich commented May 2, 2024

Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048 #13139

Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048 #13139

Conversation

komal-SkyNET commented Apr 28, 2024 • edited

Description

Fixes #13106

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

logan-markewich commented Apr 28, 2024

komal-SkyNET commented Apr 28, 2024 • edited

komal-SkyNET commented Apr 28, 2024

yisding commented Apr 30, 2024

jmorganca commented May 1, 2024

logan-markewich commented May 2, 2024

komal-SkyNET commented Apr 28, 2024 •

edited

komal-SkyNET commented Apr 28, 2024 •

edited