Optimized model slower than original one on CUDAExecutionProvider #787

nicolas-mng · 2023-12-06T14:53:50Z

What happened?

Hello,
I've been experimenting with some Olive passes on a custom model containing a transformer and some extra layers. Using the passes seem to slow down both the throughput and the latency. I've tried OrtTransformersOptimization and ONNXQuantization and they both had the same effect. Have you encountered something like this in your experimentation? Maybe there are some obvious checks?
Thanks

Version?

Commit 3c5588d

The text was updated successfully, but these errors were encountered:

xiaoyu-work · 2023-12-08T07:30:39Z

Can you share your pass config and logs? Are you using Olive built in metrics?

nicolas-mng · 2023-12-11T10:17:11Z

Hey thanks for getting back to me.
These are my passes config:

self._engine.register(OrtTransformersOptimization, config={
            "model_type": "t5",  # maybe I could play with this parameter
            "use_gpu": True,
            "only_onnxruntime": False,  # I have tried True too
            "float16": True,
            "use_gqa": False,
        })
        self._engine.register(OnnxQuantization, config={
            "weight_type": "QUInt8",
            "user_script": f"olive.py",
            "dataloader_func": "create_dataloader",
            "dataloader_func_kwargs": {
                    "num_features": self._num_features,
                    "num_targets": self._num_targets,
            },
        })

I use the official Throughput metric (priority 2), and a custom metric for accuracy computation (priority 1). Could it be that olive is relying on the accuracy too much which impacts the Throughput negatively?

I'm attaching the logs as well.
footprints.json
input_model_metrics.json
run_history_gpu-cuda.txt

Edit: also attaching the OliveModelConfig:

OliveModelConfig.parse_obj(
            {
                "type": "ONNXModel",
                "config": {
                    "model_path": model_path.parent,
                    "onnx_file_name": model_path.name,
                    "inference_settings": _get_session_options(), 
                    "use_ort_extensions": True,  # ?
                    "model_attributes": {"num_key_value_heads": 4},  # impact?
                },
            }
        )

nicolas-mng · 2023-12-11T10:44:08Z

I've also tried running without the accuracy metrics, and with different values of batch size to no avail.

nicolas-mng · 2023-12-11T11:13:45Z

I've also tried with a simpler architecture (fully connected layers only), with(out) quantization, adding an OrtPerfTuning at the end :\

nicolas-mng · 2023-12-11T11:16:21Z

Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.

xiaoyu-work · 2023-12-11T17:40:42Z

Thanks for the configs. I'll take a look at your configs. In the meanwhile can you provide the onnxruntime-gpu package version you are using for your GPU run?

nicolas-mng · 2023-12-12T14:06:37Z

Great, thanks!
onnxruntime-gpu 1.16.3

trajepl · 2023-12-13T09:25:28Z

Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.

Seems you run the quantized model in gpu, right? The int8 is not supported very well in GPU. So it is expected if you saw CPU run better performance.

FP16 would be better for gpu.

trajepl · 2023-12-13T09:27:11Z

BTW, seems you are optimizing t5 models, here is an quick demo for mt5. There might be something can be leveraged.
https://github.com/microsoft/Olive/tree/jiapli/mt5_optimum/examples/mt5

nicolas-mng · 2023-12-13T10:02:15Z

Good to know, I guess my model was already FP16 so I shouldn't see much speed up on this side.
What is the impact of model_type on OrtTransformersOptimization? I am training a custom transformer model for non-LLM purpose which has an encoder and a decoder so I thought that T5 was the closest but maybe I should try with different values of this parameter? The reason I'm asking is because I am also observing slowdowns if I just use OrtTransformersOptimization with no quantization.

trajepl · 2023-12-13T11:13:52Z

Here are available model_types. I am not sure if it totally fit into your case. But for encoder-decoder, T5 may be a good choice.
Basically the model type is used to find the proper model arch then ort can apply specific optimization teches.
https://github.com/microsoft/onnxruntime/blob/1ad6eb135959028bcc0346206c6a8b5cf17d16ee/onnxruntime/python/tools/transformers/optimizer.py#L45

Also for fp16, onnxruntime supports io_binding, you can try to enable it which will bind the data to cuda device before running, it might get performance improvement. https://github.com/microsoft/Olive/blob/main/olive/evaluator/olive_evaluator.py#L409

nicolas-mng · 2023-12-13T14:21:48Z

Thanks for the solutions. Unfortunately, they didn't help. I've tried converting to float16, and turning on IO binding. But my optimized models are still slower than the original ones (again, only on GPU).
And yes, looking at the other models, T5 is the closest to what I am working with.

nicolas-mng added the bug Something isn't working label Dec 6, 2023

nicolas-mng changed the title ~~Optimized model slower than original one~~ Optimized model slower than original one on CUDAExecutionProvider Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized model slower than original one on CUDAExecutionProvider #787

Optimized model slower than original one on CUDAExecutionProvider #787

nicolas-mng commented Dec 6, 2023

xiaoyu-work commented Dec 8, 2023

nicolas-mng commented Dec 11, 2023 •

edited

nicolas-mng commented Dec 11, 2023

nicolas-mng commented Dec 11, 2023

nicolas-mng commented Dec 11, 2023 •

edited

xiaoyu-work commented Dec 11, 2023

nicolas-mng commented Dec 12, 2023

trajepl commented Dec 13, 2023

trajepl commented Dec 13, 2023

nicolas-mng commented Dec 13, 2023

trajepl commented Dec 13, 2023

nicolas-mng commented Dec 13, 2023

Optimized model slower than original one on CUDAExecutionProvider #787

Optimized model slower than original one on CUDAExecutionProvider #787

Comments

nicolas-mng commented Dec 6, 2023

What happened?

Version?

xiaoyu-work commented Dec 8, 2023

nicolas-mng commented Dec 11, 2023 • edited

nicolas-mng commented Dec 11, 2023

nicolas-mng commented Dec 11, 2023

nicolas-mng commented Dec 11, 2023 • edited

xiaoyu-work commented Dec 11, 2023

nicolas-mng commented Dec 12, 2023

trajepl commented Dec 13, 2023

trajepl commented Dec 13, 2023

nicolas-mng commented Dec 13, 2023

trajepl commented Dec 13, 2023

nicolas-mng commented Dec 13, 2023

nicolas-mng commented Dec 11, 2023 •

edited

nicolas-mng commented Dec 11, 2023 •

edited