Is it true that context window can be safely doubled without repercussions? #7206

jarcen · 2024-05-10T19:28:45Z

jarcen
May 10, 2024

From bits of information I gathered, it seems that if model is trained with 8K and you double it to 16K by scaling RoPE it is completely fine and only becomes a problem beyond 2x. Is it true? Does it work for mainstream models like Mistral and Llama? Notable exceptions, if any?

I'm asking because it's something that is complicated to test by myself. "Needle in a haystack" is a simple test but not considered sufficient and other tests I can conduct are entirely subjective. If you have objective information about that, it would be nice, but I'm also not against hearing about your subjective evaluations. If majority of people will report it's fine then I'll be inclined to accept that.

Next question is how I would do that? Is it sufficient to set llama_context_params.n_ctx = 16000 for Llama 3? I'm hoping that runtime handles that automatically, assuming sane RoPE parameters are present inside GGUF. But I see often people play with RoPE parameters, making me feel like it's a mandatory thing to get good results. And so I'm asking to clarify that.

Answered by skoulik

May 13, 2024

2x works OK with llama-3 8B, effectively extending the context window to 16K tokens. I've tried this out of curiosity. How much the quality of responces suffer is for you to decide (or use standard benchmarking for comparison). You need to adjust RoPE paramenters, though (base and scale).
This is how I used it:

server.exe -m Meta-Llama-3-8B-Instruct_Q8_0.gguf --ctx-size 16384 --n-gpu-layers 33 -t 1 --port 8081 --override-kv tokenizer.ggml.pre=str:llama3 --chat-template llama3 --flash-attn --rope-freq-base 500000 --rope-freq-scale 0.5

--rope-freq-scale 0.5 is what actually "doubles" the context (actually compresses 16K to 8K)
--rope-freq-base is specific to each model and must be looked-up…

View full answer

skoulik · 2024-05-13T00:55:52Z

skoulik
May 13, 2024

2x works OK with llama-3 8B, effectively extending the context window to 16K tokens. I've tried this out of curiosity. How much the quality of responces suffer is for you to decide (or use standard benchmarking for comparison). You need to adjust RoPE paramenters, though (base and scale).
This is how I used it:

server.exe -m Meta-Llama-3-8B-Instruct_Q8_0.gguf --ctx-size 16384 --n-gpu-layers 33 -t 1 --port 8081 --override-kv tokenizer.ggml.pre=str:llama3 --chat-template llama3 --flash-attn --rope-freq-base 500000 --rope-freq-scale 0.5

--rope-freq-scale 0.5 is what actually "doubles" the context (actually compresses 16K to 8K)
--rope-freq-base is specific to each model and must be looked-up (use gguf-dump, for instance and look for llama.rope.freq_base) It can be changed, too - I suggest that you read about RoPE mechanics elsewhere to understand them better.

You might want to fine-tune the model on long contexts to have better results though.
The good news is, such efforts have already been made. I've found this one to be quite good: https://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF This is a Meta-Llama-3-8B-Instruct extended to 64K context and fine-tuned.

Hope this helps.

3 replies

jarcen May 13, 2024
Author

Just to be sure: these additional RoPE parameters are mandatory to specify? Runtime does not set scaling factor automatically if context size exceeds value llama.context_length from GGUF metadata? I assumed that is something runtime would be able to do by itself considering it has all the information necessary. Same question for --rope-freq-base. If it's default, I think skipping it shouldn't break anything? I checked and my L3 model has 500000 in GGUF.

ExtReMLapin May 13, 2024

Why use RoPE when you can use Self-Extend ? #4815

jarcen May 13, 2024
Author

I don't trust Self-Extend because I don't understand it fully. Too many moving parts. Scaling RoPE is much simpler, but as you can see I have doubts even about simple things.

I'm developing my own software and using Llama.cpp as library. Typical usage is infinitely scrolling chat, so context shifting is an important feature. I'm not sure how that would work with context shifting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it true that context window can be safely doubled without repercussions? #7206

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is it true that context window can be safely doubled without repercussions? #7206

jarcen May 10, 2024

Replies: 1 comment · 3 replies

skoulik May 13, 2024

jarcen May 13, 2024 Author

ExtReMLapin May 13, 2024

jarcen May 13, 2024 Author

jarcen
May 10, 2024

Replies: 1 comment 3 replies

skoulik
May 13, 2024

jarcen May 13, 2024
Author

jarcen May 13, 2024
Author