Skip to content

Is it true that context window can be safely doubled without repercussions? #7206

Answered by skoulik
jarcen asked this question in Q&A
Discussion options

You must be logged in to vote

2x works OK with llama-3 8B, effectively extending the context window to 16K tokens. I've tried this out of curiosity. How much the quality of responces suffer is for you to decide (or use standard benchmarking for comparison). You need to adjust RoPE paramenters, though (base and scale).
This is how I used it:

server.exe -m Meta-Llama-3-8B-Instruct_Q8_0.gguf --ctx-size 16384 --n-gpu-layers 33 -t 1 --port 8081 --override-kv tokenizer.ggml.pre=str:llama3 --chat-template llama3 --flash-attn --rope-freq-base 500000 --rope-freq-scale 0.5

--rope-freq-scale 0.5 is what actually "doubles" the context (actually compresses 16K to 8K)
--rope-freq-base is specific to each model and must be looked-up…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@jarcen
Comment options

@ExtReMLapin
Comment options

@jarcen
Comment options

Answer selected by jarcen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants