Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Qualcomm AI Engine Direct - [WIP] llama2... #3656

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chiwwang
Copy link
Collaborator

@chiwwang chiwwang commented May 17, 2024

example/qualcomm/llama2/llama.py can be used like:

python examples/qualcomm/llama2/llama.py -a llama_only_quant \
-b build_android \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once

Note that we don't have a runner for llama2 without split.

It's still FAR AWAY from a workable statically quantized llama2-7b.
Storiesllama-110M might work on 16a4w HTP. But please note that calibration() has not been done well.
Below is a reference command. But it can change anytime....

python examples/qualcomm/llama2/composite_llama.py \
-a storiesllama_16a4w \
-b build_android \
-s <device_id> \
-H <host_connecting_device> \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once \
--temperature 0

What we did to optimize performance on HTP is listed:

  1. One multihead attentions is transformed to multiple single head.
  2. KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU.
  3. llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py
  4. Embedding is quantized. This might need further investigation, e.g., can we move it out of the model and run on CPU..etc
  5. Support u16 and u8 mixed-precision quantization.
  6. KV-cache is left as quantized format in graph I/O.
  7. RMSNorm is tweaked a bit to reduce the quantization sensitivity.
  8. HTP Spill-Fill buffer feature is used among pte files.
  9. Convert all Linear layers to Conv2d.
  10. 10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.

example/qualcomm/llama2/llama.py can be used like:
```
python examples/qualcomm/llama2/llama.py -a llama_only_quant -b
build_android -m SM8650 --ptq 16a4w --tokenizer_model
tokenizer.model --checkpoint stories110M.pt --params params.json
--tokenizer_bin tokenizer.bin --prompt Once
```

Note that we don't have a runner for llama2 without split.

What we did to optimize performance on HTP is listed:
1. One multihead attentions is transformed to multiple single head.
2. KV-cache is changed to graph I/O. The update is performed in
   qnn_llama_runner.cpp on CPU.
3. llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py
4. Embedding is quantized. This might need further investigation, e.g.,
   can we move it out of the model on CPU..etc
5. Support u16 and u8 mixed-precision quantization.
6. KV-cache is left as quantized format in graph I/O.
7. RMSNorm is tweaked a bit to reduce the quantization sensitivity.
8. HTP Spill-Fill buffer feature is used among pte files.
9. Convert all Linear layers to Conv2d.
10 Properly set quant_min and quant_max in Observers to offset=128 in
symmetrical quantization.
Copy link

pytorch-bot bot commented May 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit aaada7f with merge base 4008600 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants