New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized model slower than original one on CUDAExecutionProvider #787
Comments
Can you share your pass config and logs? Are you using Olive built in metrics? |
Hey thanks for getting back to me. self._engine.register(OrtTransformersOptimization, config={
"model_type": "t5", # maybe I could play with this parameter
"use_gpu": True,
"only_onnxruntime": False, # I have tried True too
"float16": True,
"use_gqa": False,
})
self._engine.register(OnnxQuantization, config={
"weight_type": "QUInt8",
"user_script": f"olive.py",
"dataloader_func": "create_dataloader",
"dataloader_func_kwargs": {
"num_features": self._num_features,
"num_targets": self._num_targets,
},
}) I use the official I'm attaching the logs as well. Edit: also attaching the OliveModelConfig.parse_obj(
{
"type": "ONNXModel",
"config": {
"model_path": model_path.parent,
"onnx_file_name": model_path.name,
"inference_settings": _get_session_options(),
"use_ort_extensions": True, # ?
"model_attributes": {"num_key_value_heads": 4}, # impact?
},
}
) |
I've also tried running without the accuracy metrics, and with different values of batch size to no avail. |
I've also tried with a simpler architecture (fully connected layers only), with(out) quantization, adding an OrtPerfTuning at the end :\ |
Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference. |
Thanks for the configs. I'll take a look at your configs. In the meanwhile can you provide the onnxruntime-gpu package version you are using for your GPU run? |
Great, thanks! |
Seems you run the quantized model in gpu, right? The int8 is not supported very well in GPU. So it is expected if you saw CPU run better performance. FP16 would be better for gpu. |
BTW, seems you are optimizing t5 models, here is an quick demo for mt5. There might be something can be leveraged. |
Good to know, I guess my model was already FP16 so I shouldn't see much speed up on this side. |
Also for fp16, onnxruntime supports io_binding, you can try to enable it which will bind the data to cuda device before running, it might get performance improvement. https://github.com/microsoft/Olive/blob/main/olive/evaluator/olive_evaluator.py#L409 |
Thanks for the solutions. Unfortunately, they didn't help. I've tried converting to float16, and turning on IO binding. But my optimized models are still slower than the original ones (again, only on GPU). |
What happened?
Hello,
I've been experimenting with some Olive passes on a custom model containing a transformer and some extra layers. Using the passes seem to slow down both the throughput and the latency. I've tried
OrtTransformersOptimization
andONNXQuantization
and they both had the same effect. Have you encountered something like this in your experimentation? Maybe there are some obvious checks?Thanks
Version?
Commit 3c5588d
The text was updated successfully, but these errors were encountered: