Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Ultilize free_gpu_memory_fraction to control max VRAM consumption #25

Open
hiro-v opened this issue Mar 16, 2024 · 1 comment
Open
Assignees
Labels

Comments

@hiro-v
Copy link

hiro-v commented Mar 16, 2024

Currently, tensorrt_llm with tries to allocate as much as possible the VRAM consumption with 3 portions: https://nvidia.github.io/TensorRT-LLM/memory.html

Please add free_gpu_memory_fraction - https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L169-L172 as nitro parameter so that we can control it (of course the machine has to have enough VRAM for weight loading but we can reduce VRAM for other portions.

This would let more people with GPU VRAM constraint be able to use tensorrt_llm

@hiro-v hiro-v added the type: feature request A new feature label Mar 16, 2024
@tikikun
Copy link
Collaborator

tikikun commented Mar 21, 2024

image

Relevant document
Uploading image.png…

@tikikun tikikun self-assigned this Mar 21, 2024
@tikikun tikikun changed the title feature request: Add free_gpu_memory_fraction to control max VRAM consumption feat: Ultilize free_gpu_memory_fraction to control max VRAM consumption Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants