-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking inference of TensorRT 8.6.3 using trtexec on GPU RTX 4090 #3857
Labels
triaged
Issue has been triaged by maintainers
Comments
|
(.1 & .2) We didn't fix the GPU clock. We only have one other processes running on the GPU:
|
I saw your issue About TensorRT Latency Measure, and thought maybe you had some insight on this? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
We're benchmarking my mixed-precision models using:
trtexec --loadEngine=model.engine --useCudaGraph --iterations=100 --avgRuns=100
We compared two models, one baseline in FP16, and another model where we reduced the precision for the first "stage" in our model, and let the rest continue to be in FP16. However, we don't know if we can trust the results.
For the baseline model inspect.txt (layerInfo), the performance summary was:
We then compared to inspect.txt, the performance summary was:
While the results are promising, we are being troubled by how the quantization of one stage can lead to such a huge latency improvement. Can we trust these results?
Environment
TensorRT Version: 8.6.3
Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:24.02-py3
Relevant Files
Model link:
Both models included:
https://drive.google.com/drive/folders/1MJAP7NDO7zzRJlUJFexpTcxKVWT9tnuP?usp=sharing
The text was updated successfully, but these errors were encountered: