[Feature]: Is it possible to dynamically adjust lora tp policy to different situations ? #4704

yyccli · 2024-05-09T07:19:45Z

🚀 The feature, motivation and pitch

Thanks to the Full TP LoRA pr, now we can use S-LoRA's TP policy. However, this TP strategy provides some performance gains in the case of long prefills, while almost certainly introducing additional latency in the decode stage due to the new communication operation.
Here is some profiling results, i test qwen1.5 14b with one LoRA model

In this case, i test an end-to-end performance of long prefill + short decode

Typically, we'll usually have a long prefill, but not always get a short decode every time
So... is it possible to apply Full TP when prefill and Partial TP when decode?

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

yyccli added the feature request label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Is it possible to dynamically adjust lora tp policy to different situations ? #4704

[Feature]: Is it possible to dynamically adjust lora tp policy to different situations ? #4704

yyccli commented May 9, 2024

[Feature]: Is it possible to dynamically adjust lora tp policy to different situations ? #4704

[Feature]: Is it possible to dynamically adjust lora tp policy to different situations ? #4704

Comments

yyccli commented May 9, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context