Use SGMV for prefill BGMV for decode #464

tgaddair · 2024-05-09T20:00:46Z

Closes #333.

There were broadly two main issues affecting LoRAX throughput for single-adapter performance vs vLLM:

Use of CUDA graph compilation by default
Use of BGMV kernel during decode

In this PR, we change BGMV to be the default during decode, and apply it to CUDA graph mode. Additionally, we retain SGMV for prefill (non-CUDA graph) and have a just-in-time tracing of specific LoRA layers to avoid having to replay computation for unused LoRA layers. All in all, we are now a good bit ahead of vLLM performance on single LoRA inference, and continue to even better at multi-LoRA scale. We further find that using Medusa gives an additional boost to performance that makes LoRA inference faster than base model performance (no adapter).

vllm + compile (baseline): 61 tokens/s
lorax (baseline, sgmv only): 52 tokens/s
lorax + bgmv: 59 tokens/s
lorax + bgmv + compile: 65 tokens/s
lorax + bgmv + compile + medusa: 73 tokens/s

tgaddair added 30 commits May 8, 2024 08:45

WIP: fix graph lora

658e820

Fixes

623df88

Remove debug

16f9595

Enable bgmv

9aede4a

Bgmv

7449232

Fix

cff2494

Expand bgmv

c2118a9

Cleanup

302cd0a

Fix fill

558e161

SGMV prefill + BGMV decode

f86f698

Use SGMV again

ccffb86

Don't use segments in graph

54db1f7

Lazy transpose on demand

1180c0b

Fix tests

f25b12d

Fix test

ecb2347

Remove unused buffers

c4cdb44

More shapes

cd565c4

pack_u32

2d71275

Trace layer names

d311da9

Undo

042e50b

bgmv max rank

9bb989f

Default traced adapter layers

99671e9

Cache eviction

4377c6f

Fix

880f6fe

Fix use_sgmv

1fb7bc6

Plumb layers

f8277b9

Fix use_sgmv

d0383b5

Fix

672a851

Fix re-trace

e7f51cc

Retrace

8788c4e

tgaddair marked this pull request as ready for review May 13, 2024 22:18

Remove commented code

da6ddcf

tgaddair requested review from magdyksaleh and arnavgarg1 May 13, 2024 22:21

tgaddair added 2 commits May 13, 2024 21:09

Fix

69dc424

Fill

9c27444

tgaddair merged commit 7306d49 into main May 14, 2024
1 check passed

tgaddair deleted the fix-graph-lora branch May 14, 2024 04:18

tgaddair mentioned this pull request May 14, 2024

Performance drop when using a LoRA Adapter #333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SGMV for prefill BGMV for decode #464

Use SGMV for prefill BGMV for decode #464

tgaddair commented May 9, 2024 •

edited

Use SGMV for prefill BGMV for decode #464

Use SGMV for prefill BGMV for decode #464

Conversation

tgaddair commented May 9, 2024 • edited

tgaddair commented May 9, 2024 •

edited