Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the latency of load_batched_adapter_weights #433

Open
thincal opened this issue Apr 22, 2024 · 1 comment · May be fixed by #434
Open

Improve the latency of load_batched_adapter_weights #433

thincal opened this issue Apr 22, 2024 · 1 comment · May be fixed by #434
Labels
enhancement New feature or request

Comments

@thincal
Copy link
Contributor

thincal commented Apr 22, 2024

Feature request

Currently, every lora layer would be moved from CPU to target device of base model, results in extra 20ms in each layer, finally 500ms ~ 1+s latency overall.

  1. first loading into CPU memory
def load_module_map(
    ...
    for filename in adapter_filenames:
        adapter_weights.update(load_file(filename))
   ...
  1. moving to GPU device inside load_batched_adapter_weights
lora_a = lora_a.to(base_device, self.dtype)
lora_b = lora_b.to(base_device, self.dtype)

Motivation

Improve the adapter loading performance.

Your contribution

Yes, I will prepare a PR for review.

@thincal thincal changed the title Improve the latency of load_batched_adapter_weights Improve the latency of load_batched_adapter_weights Apr 22, 2024
@tgaddair
Copy link
Contributor

Thanks for working on this @thincal! We could probably work around this by keeping weights in the safetensors file rather than loading to CPU as an intermediate step.

@tgaddair tgaddair added the enhancement New feature or request label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants