Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized model slower than original one on CUDAExecutionProvider #787

Open
nicolas-mng opened this issue Dec 6, 2023 · 12 comments
Open
Labels
bug Something isn't working

Comments

@nicolas-mng
Copy link

What happened?

Hello,
I've been experimenting with some Olive passes on a custom model containing a transformer and some extra layers. Using the passes seem to slow down both the throughput and the latency. I've tried OrtTransformersOptimization and ONNXQuantization and they both had the same effect. Have you encountered something like this in your experimentation? Maybe there are some obvious checks?
Thanks

Version?

Commit 3c5588d

@nicolas-mng nicolas-mng added the bug Something isn't working label Dec 6, 2023
@xiaoyu-work
Copy link
Contributor

Can you share your pass config and logs? Are you using Olive built in metrics?

@nicolas-mng
Copy link
Author

nicolas-mng commented Dec 11, 2023

Hey thanks for getting back to me.
These are my passes config:

self._engine.register(OrtTransformersOptimization, config={
            "model_type": "t5",  # maybe I could play with this parameter
            "use_gpu": True,
            "only_onnxruntime": False,  # I have tried True too
            "float16": True,
            "use_gqa": False,
        })
        self._engine.register(OnnxQuantization, config={
            "weight_type": "QUInt8",
            "user_script": f"olive.py",
            "dataloader_func": "create_dataloader",
            "dataloader_func_kwargs": {
                    "num_features": self._num_features,
                    "num_targets": self._num_targets,
            },
        })

I use the official Throughput metric (priority 2), and a custom metric for accuracy computation (priority 1). Could it be that olive is relying on the accuracy too much which impacts the Throughput negatively?

I'm attaching the logs as well.
footprints.json
input_model_metrics.json
run_history_gpu-cuda.txt


Edit: also attaching the OliveModelConfig:

OliveModelConfig.parse_obj(
            {
                "type": "ONNXModel",
                "config": {
                    "model_path": model_path.parent,
                    "onnx_file_name": model_path.name,
                    "inference_settings": _get_session_options(), 
                    "use_ort_extensions": True,  # ?
                    "model_attributes": {"num_key_value_heads": 4},  # impact?
                },
            }
        )

@nicolas-mng
Copy link
Author

I've also tried running without the accuracy metrics, and with different values of batch size to no avail.

@nicolas-mng
Copy link
Author

I've also tried with a simpler architecture (fully connected layers only), with(out) quantization, adding an OrtPerfTuning at the end :\

@nicolas-mng
Copy link
Author

nicolas-mng commented Dec 11, 2023

Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.

@nicolas-mng nicolas-mng changed the title Optimized model slower than original one Optimized model slower than original one on CUDAExecutionProvider Dec 11, 2023
@xiaoyu-work
Copy link
Contributor

Thanks for the configs. I'll take a look at your configs. In the meanwhile can you provide the onnxruntime-gpu package version you are using for your GPU run?

@nicolas-mng
Copy link
Author

Great, thanks!
onnxruntime-gpu 1.16.3

@trajepl
Copy link
Contributor

trajepl commented Dec 13, 2023

Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.

Seems you run the quantized model in gpu, right? The int8 is not supported very well in GPU. So it is expected if you saw CPU run better performance.

FP16 would be better for gpu.

@trajepl
Copy link
Contributor

trajepl commented Dec 13, 2023

BTW, seems you are optimizing t5 models, here is an quick demo for mt5. There might be something can be leveraged.
https://github.com/microsoft/Olive/tree/jiapli/mt5_optimum/examples/mt5

@nicolas-mng
Copy link
Author

Good to know, I guess my model was already FP16 so I shouldn't see much speed up on this side.
What is the impact of model_type on OrtTransformersOptimization? I am training a custom transformer model for non-LLM purpose which has an encoder and a decoder so I thought that T5 was the closest but maybe I should try with different values of this parameter? The reason I'm asking is because I am also observing slowdowns if I just use OrtTransformersOptimization with no quantization.

@trajepl
Copy link
Contributor

trajepl commented Dec 13, 2023

image
Here are available model_types. I am not sure if it totally fit into your case. But for encoder-decoder, T5 may be a good choice.
Basically the model type is used to find the proper model arch then ort can apply specific optimization teches.
https://github.com/microsoft/onnxruntime/blob/1ad6eb135959028bcc0346206c6a8b5cf17d16ee/onnxruntime/python/tools/transformers/optimizer.py#L45

Also for fp16, onnxruntime supports io_binding, you can try to enable it which will bind the data to cuda device before running, it might get performance improvement. https://github.com/microsoft/Olive/blob/main/olive/evaluator/olive_evaluator.py#L409
image

@nicolas-mng
Copy link
Author

Thanks for the solutions. Unfortunately, they didn't help. I've tried converting to float16, and turning on IO binding. But my optimized models are still slower than the original ones (again, only on GPU).
And yes, looking at the other models, T5 is the closest to what I am working with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants