[Usage]: How to offload some layers to CPU？ #3931

cheney369 · 2024-04-09T09:10:35Z

Your current environment

None

How would you like to use vllm

I want to load qwen2-14B-chat using VLLM, but I only have 1 RTX4090(24G).
Can vllm offload some layers to cpu and others to gpu?
As I know, the transformers-accelerate and llama.cpp can do it. But I want to use the multilora switch function in VLLM.

eigen2017 · 2024-04-23T12:27:55Z

#3563

eigen2017 · 2024-04-24T02:34:03Z

#627

eigen2017 · 2024-04-24T06:37:11Z

bd-iaas-us#1

eigen2017 · 2024-04-24T06:55:36Z

bd-iaas-us#3

eigen2017 · 2024-05-06T12:37:14Z

it's not a good idea to use cpu mem since vllm is for inference accelerate .
there is a trade-off choice that if we can cut some weight to fit poor HBM, moe models can cut some experts etc.
see this: huggingface/transformers#30552

cheney369 added the usage How to use vllm label Apr 9, 2024

eigen2017 mentioned this issue Apr 25, 2024

Design doc for cpu offloading feature bd-iaas-us/vllm#3

Open

eigen2017 mentioned this issue May 13, 2024

Sync huggingface modifications of qwen Moe model #4774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to offload some layers to CPU？ #3931

[Usage]: How to offload some layers to CPU？ #3931

cheney369 commented Apr 9, 2024

eigen2017 commented Apr 23, 2024

eigen2017 commented Apr 24, 2024

eigen2017 commented Apr 24, 2024

eigen2017 commented Apr 24, 2024

eigen2017 commented May 6, 2024

[Usage]: How to offload some layers to CPU？ #3931

[Usage]: How to offload some layers to CPU？ #3931

Comments

cheney369 commented Apr 9, 2024

Your current environment

How would you like to use vllm

eigen2017 commented Apr 23, 2024

eigen2017 commented Apr 24, 2024

eigen2017 commented Apr 24, 2024

eigen2017 commented Apr 24, 2024

eigen2017 commented May 6, 2024