New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to perform Whisper GPU Int8 conversion #869
Comments
The general workflow to run a optimized model in ARM device(like android) is like:
I suppose the onnxruntime qnn execution provider should be the right runtime you need. We also work with QNN ep team to support that. But as of now, there is no standard example to show. |
What is your device to run the gpu optimization? I.e.: In step1, where you run the whisper example? Any ARM or x86_64 devices?
|
@cfasana is the error you are reporting when using the optimized model for inference on your android device or when running Olive workflow? If it's the former, like @trajepl says, cuda ep which the gpu workflows optimize for is not supported on android. So it is probably using the cpu ep which doesn't support the masked attention operator. There is currently no example for optimizing this model for an android gpu. If it is the later, please share the versions of your package and the logs from the run. |
@trajepl I am running all the optimizations in WSL2 Ubuntu 20.04 and then once I have the models, I use them on the Android device. @jambayk the error occurs when running Olive workflow, more precisely when using the following command: Here is the result of
Here is the full output I get when executing
Finally, here is the content of the log file:
It seems not to find this library. I had a look at my CUDA installation and my CUDA version is 12.* and I can find the library libcublasLt.so.12. Thus, should I install an older version of CUDA? |
I use cuda 12 as well. Also here is my libcublasLt.so. Have you tried to put the cuda/lib path under LD_LIBRARY_PATH? Or create a symbolic link? Also please run following code check your onnxruntime-gpu to ensure CUDA ep is in your list: import onnxruntime as ort
ort.get_available_providers() |
I've asked a similar question (#578), and looks like Android mobile GPU (NNAPI) is not supported now, only CPU execution provider is available. |
Can any ONNX model optimized by Olive be implemented in Android App? |
@yurii-k-ring I have already heard about that, but thanks for confirming it. Anyway, I would still like to be able to build the model for the GPU configuration. @FepeIMT yes, you can optimize the ONNX model and then use ONNX Runtime to deploy it on an Android device (https://onnxruntime.ai/docs/tutorials/mobile/). |
Hi Thank you! |
It's roughly 700 ms an inference on Pixel Fold, I'd argue has a 2022-class Android process. Both tiny and base run well. This is my Flutter library, FONNX, that supports it on all platforms, so you can run the example app to get an idea of its a good fit before committing to integrating it on your own: https://github.com/Telosnex/fonnx |
I am using Olive to optimize and quantize the Whisper model since I have to run it on an Android device with constrained resources.
I was able to successfully convert the model to run on the CPU, both for the FP32 and INT8 precisions.
Now, I would like to understand whether it is also possible to exploit the GPU of the Android device to boost the performance. However, when I try to optimize the model, I get an error.
I installed onnruntime-gpu and followed the steps described in Olive/examples/whisper/README.md. The error that arises is the following:
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for DecoderMaskedMultiHeadAttention(1) node with name 'Attention_0'
Is there a way to fix it?
The text was updated successfully, but these errors were encountered: