New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the latency of load_batched_adapter_weights
#433
Labels
enhancement
New feature or request
Comments
thincal
changed the title
Improve the latency of load_batched_adapter_weights
Improve the latency of Apr 22, 2024
load_batched_adapter_weights
Thanks for working on this @thincal! We could probably work around this by keeping weights in the safetensors file rather than loading to CPU as an intermediate step. |
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature request
Currently, every lora layer would be moved from CPU to target device of base model, results in extra 20ms in each layer, finally 500ms ~ 1+s latency overall.
Motivation
Improve the adapter loading performance.
Your contribution
Yes, I will prepare a PR for review.
The text was updated successfully, but these errors were encountered: