-
From bits of information I gathered, it seems that if model is trained with 8K and you double it to 16K by scaling RoPE it is completely fine and only becomes a problem beyond 2x. Is it true? Does it work for mainstream models like Mistral and Llama? Notable exceptions, if any? I'm asking because it's something that is complicated to test by myself. "Needle in a haystack" is a simple test but not considered sufficient and other tests I can conduct are entirely subjective. If you have objective information about that, it would be nice, but I'm also not against hearing about your subjective evaluations. If majority of people will report it's fine then I'll be inclined to accept that. Next question is how I would do that? Is it sufficient to set |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
2x works OK with llama-3 8B, effectively extending the context window to 16K tokens. I've tried this out of curiosity. How much the quality of responces suffer is for you to decide (or use standard benchmarking for comparison). You need to adjust RoPE paramenters, though (base and scale). server.exe -m Meta-Llama-3-8B-Instruct_Q8_0.gguf --ctx-size 16384 --n-gpu-layers 33 -t 1 --port 8081 --override-kv tokenizer.ggml.pre=str:llama3 --chat-template llama3 --flash-attn --rope-freq-base 500000 --rope-freq-scale 0.5 --rope-freq-scale 0.5 is what actually "doubles" the context (actually compresses 16K to 8K) You might want to fine-tune the model on long contexts to have better results though. Hope this helps. |
Beta Was this translation helpful? Give feedback.
2x works OK with llama-3 8B, effectively extending the context window to 16K tokens. I've tried this out of curiosity. How much the quality of responces suffer is for you to decide (or use standard benchmarking for comparison). You need to adjust RoPE paramenters, though (base and scale).
This is how I used it:
server.exe -m Meta-Llama-3-8B-Instruct_Q8_0.gguf --ctx-size 16384 --n-gpu-layers 33 -t 1 --port 8081 --override-kv tokenizer.ggml.pre=str:llama3 --chat-template llama3 --flash-attn --rope-freq-base 500000 --rope-freq-scale 0.5
--rope-freq-scale 0.5 is what actually "doubles" the context (actually compresses 16K to 8K)
--rope-freq-base is specific to each model and must be looked-up…