Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScaleLLM vs vLLM in performance #144

Open
WangErXiao opened this issue Apr 27, 2024 · 14 comments
Open

ScaleLLM vs vLLM in performance #144

WangErXiao opened this issue Apr 27, 2024 · 14 comments
Labels

Comments

@WangErXiao
Copy link

Is there comparison performance data between ScaleLLM and vLLM

@zhyncs
Copy link

zhyncs commented Apr 27, 2024

Hi @guocuimi Thanks for your outstanding work. In addition to performance comparison with vLLM, if possible, please consider adding TensorRT-LLM, LMDeploy, RTP-LLM, and TGI. And maybe we could use vLLM benchmark serving. Thanks.

@guocuimi
Copy link
Collaborator

guocuimi commented Apr 27, 2024

thank you for your interest in ScaleLLM. Yeah, it is indeed in our roadmap. we do have some internal numbers but not ready to share yet. As part of our upcoming plans, we will do a comprehensive comparisons (in a separate ropro) in coming weeks after finishing python wrapper part. Stay tuned!

Meanwhile, feel free to conduct your own benchmarks for your specific scenarios using the vLLM benchmark serving script. Thanks.

@zhyncs
Copy link

zhyncs commented May 10, 2024

Hi @guocuimi May you use GitHub Action to release the Python Package? Consider supporting CUDA 11.8 and CUDA 12.2, which will make it more convenient for users to use. At the same time, we can easily compare performance with other frameworks by using the compatible OpenAI Server.

@guocuimi
Copy link
Collaborator

thanks for your advice. yeah, it is our plan. i am working on setting up the whl build for each release. for now, i am trying to reduce whl size first. should be ready this week. stay tuned!

@guocuimi
Copy link
Collaborator

Hi @zhyncs A quick update for you: python is supported in latest release.
you can install scalellm with pip: pip install scalellm and start rest api server with python3 -m scalellm.serve.api_server
Please let me know if you have any questions. thanks

@zhyncs
Copy link

zhyncs commented May 20, 2024

Hi @zhyncs A quick update for you: python is supported in latest release. you can install scalellm with pip: pip install scalellm and start rest api server with python3 -m scalellm.serve.api_server Please let me know if you have any questions. thanks

cool! I will verify it asap, thanks.

@zhyncs
Copy link

zhyncs commented May 20, 2024

Hi @guocuimi The package you are currently compiling in GitHub Action depends on GLIBC_2.27, which is not very friendly for the widely used CentOS 7 in the industry and still requires manual compilation.

@guocuimi
Copy link
Collaborator

Thanks for letting me know. Let me try to downgrade GCC to 10 and republish new packages using manylinux2014
(CentOS 7 based)

Toolchain: GCC 10

@zhyncs
Copy link

zhyncs commented Jun 11, 2024

republish new packages

The latest version is ok https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.1.3
And may you update the doc https://github.com/vectorch-ai/ScaleLLM/tree/main/docs/source For example, how to set up an OpenAI-compatible server. Thanks.

@zhyncs
Copy link

zhyncs commented Jun 11, 2024

If the interface is compatible, then we may directly use vLLM's script for benchmark at https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py

The recent BentoML blog post https://www.bentoml.com/blog/benchmarking-llm-inference-backends can also serve as a reference.

@guocuimi
Copy link
Collaborator

guocuimi commented Jun 11, 2024 via email

@zhyncs
Copy link

zhyncs commented Jun 11, 2024

we can use it directly

I gave it a quick try, and there seems to be a problem at the moment.

https://github.com/vllm-project/vllm/blob/351d5e7b8253d754b2a951152cd48927c4c1629d/benchmarks/backend_request_func.py#L261-L262

python3 -m scalellm.serve.api_server --model /workdir/Llama-2-13b-chat-hf

python3 benchmark_serving.py --port 8080 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1 --request-rate 1

@zhyncs
Copy link

zhyncs commented Jun 11, 2024

in coming weeks

Looking forward to your results.

@guocuimi
Copy link
Collaborator

Thanks, never tried that benchmark script. Will try that after wrapping up current feature parity works for logprobs and best_of. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

3 participants