Can vllm serving clients by using multiple model instances? #239

aoyulong · 2023-06-21T07:24:05Z

aoyulong
Jun 21, 2023

Based on the examples, vllm can launch a server with a single model instances. Can vllm serving clients by using multiple model instances? With multiple model instances, the sever will dispatch the requests to different instances to reduce the overhead.

Answered by zhuohan123

Jun 21, 2023

Right now vLLM is a serving engine for a single model. You can start multiple vLLM server replicas and use a custom load balancer (e.g., nginx load balancer). Also feel free to checkout FastChat and other multi-model frontends (e.g., aviary). vLLM can be a model worker of these libraries to support multi-replica serving.

View full answer

zhuohan123 · 2023-06-21T10:47:16Z

zhuohan123
Jun 21, 2023
Maintainer

Right now vLLM is a serving engine for a single model. You can start multiple vLLM server replicas and use a custom load balancer (e.g., nginx load balancer). Also feel free to checkout FastChat and other multi-model frontends (e.g., aviary). vLLM can be a model worker of these libraries to support multi-replica serving.

0 replies

hughesadam87 · 2024-05-06T14:00:12Z

hughesadam87
May 6, 2024

Is this still the case? If so, why does the API support a model parameter if the intents is not to host multiple models?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can vllm serving clients by using multiple model instances? #239

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Can vllm serving clients by using multiple model instances? #239

aoyulong Jun 21, 2023

Replies: 2 comments

zhuohan123 Jun 21, 2023 Maintainer

hughesadam87 May 6, 2024

aoyulong
Jun 21, 2023

zhuohan123
Jun 21, 2023
Maintainer

hughesadam87
May 6, 2024