Use SGMV for prefill BGMV for decode #464
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #333.
There were broadly two main issues affecting LoRAX throughput for single-adapter performance vs vLLM:
In this PR, we change BGMV to be the default during decode, and apply it to CUDA graph mode. Additionally, we retain SGMV for prefill (non-CUDA graph) and have a just-in-time tracing of specific LoRA layers to avoid having to replay computation for unused LoRA layers. All in all, we are now a good bit ahead of vLLM performance on single LoRA inference, and continue to even better at multi-LoRA scale. We further find that using Medusa gives an additional boost to performance that makes LoRA inference faster than base model performance (no adapter).
vllm + compile (baseline): 61 tokens/s
lorax (baseline, sgmv only): 52 tokens/s
lorax + bgmv: 59 tokens/s
lorax + bgmv + compile: 65 tokens/s
lorax + bgmv + compile + medusa: 73 tokens/s