You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks to the Full TP LoRA pr, now we can use S-LoRA's TP policy. However, this TP strategy provides some performance gains in the case of long prefills, while almost certainly introducing additional latency in the decode stage due to the new communication operation.
Here is some profiling results, i test qwen1.5 14b with one LoRA model
In this case, i test an end-to-end performance of long prefill + short decode
Typically, we'll usually have a long prefill, but not always get a short decode every time
So... is it possible to apply Full TP when prefill and Partial TP when decode?
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
馃殌 The feature, motivation and pitch
Thanks to the Full TP LoRA pr, now we can use S-LoRA's TP policy. However, this TP strategy provides some performance gains in the case of
long prefills
, while almost certainly introducing additional latency in thedecode
stage due to the new communication operation.Here is some profiling results, i test
qwen1.5 14b
with one LoRA modelIn this case, i test an end-to-end performance of
long prefill
+short decode
Typically, we'll usually have a
long prefill
, but not always get ashort decode
every timeSo... is it possible to apply Full TP when
prefill
and Partial TP whendecode
?Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: