Skip to content

Question regarding the nearly double GPU memory consumption. #241

Answered by zhuohan123
Zhuqln asked this question in Q&A
Discussion options

You must be logged in to vote

Thank you for bringing this up. Yes, this extra memory usage is because of the KV cache. vLLM pre-allocates and reserves the maximum possible amount of memory for KV cache blocks. The KV cache generated during the inference will be written to these reserved memory blocks. You can limit the GPU memory usage by setting the parameter gpu_memory_utilization.

Replies: 4 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
4 replies
@Zhuqln
Comment options

@zhuohan123
Comment options

@Zhuqln
Comment options

@humza-sami
Comment options

Answer selected by zhuohan123
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
5 participants
Converted from issue

This discussion was converted from issue #235 on June 25, 2023 16:50.