Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: How to offload some layers to CPU? #3931

Open
cheney369 opened this issue Apr 9, 2024 · 5 comments
Open

[Usage]: How to offload some layers to CPU? #3931

cheney369 opened this issue Apr 9, 2024 · 5 comments
Labels
usage How to use vllm

Comments

@cheney369
Copy link

Your current environment

None

How would you like to use vllm

I want to load qwen2-14B-chat using VLLM, but I only have 1 RTX4090(24G).
Can vllm offload some layers to cpu and others to gpu?
As I know, the transformers-accelerate and llama.cpp can do it. But I want to use the multilora switch function in VLLM.

@cheney369 cheney369 added the usage How to use vllm label Apr 9, 2024
@eigen2017
Copy link
Contributor

#3563

@eigen2017
Copy link
Contributor

#627

@eigen2017
Copy link
Contributor

bd-iaas-us#1

@eigen2017
Copy link
Contributor

bd-iaas-us#3

@eigen2017
Copy link
Contributor

it's not a good idea to use cpu mem since vllm is for inference accelerate .
there is a trade-off choice that if we can cut some weight to fit poor HBM, moe models can cut some experts etc.
see this: huggingface/transformers#30552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants