Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Support specifying quant_config details in the LLM or Server entrypoints #4743

Open
mgoin opened this issue May 10, 2024 · 1 comment

Comments

@mgoin
Copy link
Collaborator

mgoin commented May 10, 2024

馃殌 The feature, motivation and pitch

Background:

With the recent support for deepspeedfp quantization introduced in #4652 and #4690, a new issue has emerged due to the nature of the runtime quantization implementation. This implementation allows users to load an unquantized model and enable the quantization argument to reduce the memory footprint required for loading the model. However, the main challenge lies in the fact that the deepspeedfp implementation has a parameter num_bits that supports quantizing the weights down to either 8 or 6 bits, with the default value set to 8.

Problem Statement:

Currently, if a user wants to apply quantization="deepspeedfp", vLLM will only be able to quantize the model to num_bits=8 since that is the default value. The only way to change this behavior is by providing a quant_config.json file that explicitly defines the desired value for num_bits. This limitation restricts users from easily customizing the quantization settings without modifying the configuration file.

Proposed Solution:

To address this issue, we propose adding a new argument quant_kwargs=Union[str, Dict] to the common LLM() and OpenAI server interfaces in vLLM. This argument would accept either a dictionary of keyword arguments or a string that can be converted to a dictionary. The purpose of quant_kwargs is to allow users to override the default values or loaded config values for the quantization configuration.

By introducing this new argument, users gain the flexibility to specify custom quantization settings directly through the API, without the need to modify the quant_config.json file. This enhancement improves the usability and convenience of applying quantization in vLLM, enabling users to easily experiment with different quantization settings based on their specific requirements.

Alternatives

No response

Additional context

No response

@robertgshaw2-neuralmagic
Copy link
Collaborator

Can you tag with RFC?

@mgoin mgoin changed the title [Feature]: Support specifying quant_config details in the LLM or Server entrypoints [RFC]: Support specifying quant_config details in the LLM or Server entrypoints May 10, 2024
@zhuohan123 zhuohan123 added the RFC label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants