-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorrt Mix Precision or INT8 conversion, mix precision almost same size and speed with INT8, but better precision, the converted model have good detection result with mix precision. #10046
base: main
Are you sure you want to change the base?
Conversation
@glenn-jocher this is my last pull request.
the image size need to keep the same when exporting to engine and inference with engine. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10046 +/- ##
==========================================
- Coverage 78.83% 75.67% -3.17%
==========================================
Files 121 122 +1
Lines 15351 15404 +53
==========================================
- Hits 12102 11657 -445
- Misses 3249 3747 +498
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Hey there! 👋 Thanks for sharing your insights and improvements on mixed precision and INT8 conversion. It's great to see your continuous effort to enhance inference efficiency while keeping the process streamlined. I absolutely agree that leveraging a larger batch size for calibration can be crucial for achieving accurate INT8 quantization, as recommended by the NVIDIA documentation. Your approach to dynamically set calculus batch sizes versus ONNX export settings seems like a practical solution. As for simplifying the use of Finally, ensuring consistent image sizes during export and inference is key to maintaining performance and avoiding any unexpected behavior. Your contributions are highly valued, and your last pull request seems to encapsulate these thoughtful changes well. Keep up the fantastic work! If there's anything more we can assist with or discuss further, feel free to reach out. Happy coding! 😊 |
@ZouJiu1 currently this is not working with either arguments Also, please don't close and re-open a PR when you make changes. The entire reason to use |
@Burhan-Q ok, I will commit to this PR directly. the batch size calib_batch will affect the the calibration data is setted by the augment "source", so I think we need add a example to the document and give a sample like download voc2007 dataset and unzip auto. calib_batch will lead a AttributeError, it need to modify the now, it is time to sleep. 23:02, see you tomorrow. |
@Burhan-Q , I add calib_batch augment to default.yaml, and add INT8 and Mix precision tensorrt examples and augment to some documents. there has no AttributeError with calib_batch any more. and the usage and explanation are added to document. |
@ZouJiu1 I see your additions and will have to provide my feedback tomorrow. I have done some testing and I think that there is a lot of work that is still needed to clean up and optimize the addition of As it is now, this is a lot more code than is needed and is likely easily broken. To incorporate |
@ZouJiu1 I think that the "mixed precision" option is not necessary per the Nvidia TensorRT documentation:
While running calibration (implicit quantization), TensorRT will select the optimal data type for each layer. I have confirmed that there are a mix of |
@Burhan-Q I understand that explicit quantization have many Q/DQ nodes, like pytorch QAT or pytorch_quantization QAT, it will generate so many quantization/Dequantization nodes. I also agree that tensorrt conversion will select the computational precision based on performance considerations.sampleINT8API, lastest_sampleINT8API, explicit-implicit-quantization but using mix precision, you can determine the precision on each layer. If the converted engine's precision or recall or mAP is not so good even very bad. Then we can use the mix precision to get more high precison, recall or mAP. I didn't test the converted engine file's precision, recall or mAP. So I can not tell you how the mix precision will help to increasing the precision, recall or mAP. but I think mix precision is better than INT8. When I set the onnx input and output to half, and using it to convert a INT8 engine, the converted engine file has half input and half output. all model yolov8x_int8.engine, yolov8n_int8.engine, yolov8l_int8.engine and so on, they have no inference result in bus.jpg. After I push several pull request, I do some modification to my codes, and I find the reason why INT8 engine have no detection result, which is the input and output dtype=half(FP16). The input and output dtype have a important and significant impact to the result. when the input and output dtype is the same half(FP16), the mix precison have a good result, but the int8 have no result. So I think the mix precison is better. Because setting the precision forces TensorRT to choose the implementations which run at this precision. You can determine the layer’s precision. Not let the tensorrt to choose.
The mix precision example in tensorrt github release/10.0/samples/python/efficientdet/build_engine.py#L188-L218 def set_mixed_precision(self):
"""
Experimental precision mode.
Enable mixed-precision mode. When set, the layers defined here will be forced to FP16 to maximize
INT8 inference accuracy, while having minimal impact on latency.
"""
self.config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
self.config.set_flag(trt.BuilderFlag.DIRECT_IO)
self.config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)
# All convolution operations in the first four blocks of the graph are pinned to FP16.
# These layers have been manually chosen as they give a good middle-point between int8 and fp16
# accuracy in COCO, while maintining almost the same latency as a normal int8 engine.
# To experiment with other datasets, or a different balance between accuracy/latency, you may
# add or remove blocks.
for i in range(self.network.num_layers):
layer = self.network.get_layer(i)
if layer.type == trt.LayerType.CONVOLUTION and any([
# AutoML Layer Names:
"/stem/" in layer.name,
"/blocks_0/" in layer.name,
"/blocks_1/" in layer.name,
"/blocks_2/" in layer.name,
# TFOD Layer Names:
"/stem_conv2d/" in layer.name,
"/stack_0/block_0/" in layer.name,
"/stack_1/block_0/" in layer.name,
"/stack_1/block_1/" in layer.name,
]):
self.network.get_layer(i).precision = trt.DataType.HALF
log.info("Mixed-Precision Layer {} set to HALF STRICT data type".format(layer.name)) |
@ZouJiu1 I'm not seeing INT8 input and output layers when exporting the quantized models, they're
The default is that they will be FP32 and not INT8, even when exporting to INT8 quantized models. To be clear, I'm not questioning your tests or results, I believe you. The issue is that including a "mixed precision" option is going to have limited benefit for the majority of users and would primarily become more of an issue for maintenance. |
I tested with exporting from ultralytics import YOLO, ASSETS
model = YOLO("yolov8x.pt")
im = ASSETS / "bus.jpg"
out = model.export(format="engine", data="coco.yaml", int8=True, batch=8, dynamic=True, workspace=2)
model = YOLO(out, task="detect")
result = model.predict(im)
>>> Loading yolov8x.engine for TensorRT inference...
>>> [04/17/2024-09:28:08] [TRT] [I] Loaded engine size: 72 MiB
>>> [04/17/2024-09:28:08] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1110, now: CPU 0, GPU 1176 (MiB)
>>> image 1/1 ultralytics/assets/bus.jpg: 640x640 4 persons, 1 bus, 10.9ms
Speed: 4.3ms preprocess, 10.9ms inference, 1243.7ms postprocess per image at shape (1, 3, 640, 640)
result[0].boxes.data
>>> tensor([[1.0418e+01, 2.2991e+02, 7.9810e+02, 7.3954e+02, 9.4087e-01, 5.0000e+00],
[2.2316e+02, 4.0466e+02, 3.4449e+02, 8.4835e+02, 8.6566e-01, 0.0000e+00],
[6.6894e+02, 3.9504e+02, 8.1000e+02, 8.7364e+02, 8.6506e-01, 0.0000e+00],
[5.0419e+01, 3.9674e+02, 2.4674e+02, 9.0418e+02, 8.5865e-01, 0.0000e+00],
[6.2339e-02, 5.5515e+02, 7.8911e+01, 8.7119e+02, 7.1263e-01, 0.0000e+00]], device='cuda:0')
eng = model.predictor.model.model
eng.get_tensor_dtype("images")
>>> <DataType.FLOAT: 0>
eng.get_tensor_dtype("output0")
>>> <DataType.FLOAT: 0> |
@ZouJiu1 my intention in giving feedback here is that I want to try to help make your PR something that could be merged. I had planned to add TensorRT INT8 support myself, but since you started a PR, I thought I'd try to collaborate with you on making it work. The issue is that in it's current state, this PR is unlikely to be accepted. We can work together on making changes or you can leave it as is and I can open my own PR, which is more likely (not certainly) to be accepted. Please let me know how you'd like to proceed. |
@Burhan-Q , Ok, I know it, I just don't konw what should I do in the next step, maybe remove the mix precision part from codes and doc and just keep the INT8 in this PR. If the mix precision part should be removed, I will do it. Also, If you have a better INT8 implement codes, I think you should commit a PR, no need to care about my PR. What I used before is below. I modify the engine/exporter.py#L240-L241. if self.args.half and onnx and self.device.type != "cpu":
im, model = im.half(), model.half() # to FP16
if (self.args.half or self.args.int8) and engine and self.device.type != "cpu":
im, model = im.half(), model.half() # to FP16 Then, the engine's input and output dtype will be FP16(half), and the detection will have no inference result.
the codes I used to convert. import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
if __name__ == '__main__':
file = r'yolov8n.pt'
# file = r'yolov8n-cls.pt'
# file = r'yolov8n-seg.pt'
# file = r'yolov8n-pose.pt'
# file = r'yolov8n-obb.pt'
# task: [classify, detect, segment, pose, obb]
model = YOLO(file, task='detect') # load a pretrained model (recommended for training)
calib_input = r'E:\work\codeRepo\deploy\jz\val2017'
'''
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c
To avoid this issue, calibrate with as large a single batch as possible,
and ensure that calibration batches are well randomized and have similar distribution.
'''
imgsz = 640
model.export(format=r"engine", source = calib_input, batch=1, calib_batch=20,
simplify=True, half=True, int8=True, device=0,
imgsz=imgsz)
del model
gc.collect()
model = YOLO(r"E:\work\%s"%(file.replace(".pt", ".engine")))
result = model.predict('https://ultralytics.com/images/bus.jpg',
save=True,
imgsz=imgsz)
eng = model.predictor.model.model
k = eng.get_tensor_dtype("images")
k1 = eng.get_tensor_dtype("output0") the result have no detection However, if I use the mix precision to convert, it detection will have a good result model.export(format=r"engine", source = calib_input, batch=1, calib_batch=20,
simplify=True, half=True, int8=True, device=0,
imgsz=imgsz) the output log
this is why I think Mix precision is better than INT8. But now I can not duplicate the result in INT8 input and output. Maybe some places is wrong. I am not sure about the wrong place. |
I agree with you that the maintance of issue will be much higher. So I remove the mix precision part now. If you find that the mix precision is better and is necessary, let me know and I will add it again. |
Thanks for understanding and taking action on the feedback! It's always great to see such responsive and considerate collaboration. 🙌 If we find a strong need for the mixed precision feature in the future, we'll definitely reach out for your insights and contributions. For now, focusing on improving and refining INT8 support seems like our best path forward. Keep up the fantastic work! If you have any further updates or questions, don't hesitate to share. |
I found that this to be an issue in my testing:
If you set I still have some work to do before it's ready, but I also opened a PR #10165 for adding INT8 with TensorRT and included some of my results as well. |
I have checked for Existing Contributions: Before submitting, my contribution is unique and complementary.
No related issues.
new feature
running model in edge device like nvidia jetson
So I want to convert it to int8, but when I add codes to ultralytics and convert yolov8x.pt to a int8 engine, it is still ok, but when inference with int8, I find it has no detection result.
I search so many blogs and github issues, including tensorrt github, I was looking for a reason and a solution, but I can not find one. I try all the official yolov8*.pt with int8, these converted int8 engine have no detection result. But I find a possible solution, It is the mix precision https://github.com/NVIDIA/TensorRT/blob/main/samples/python/efficientdet/build_engine.py#L188-L218. Mix precision definition is that some layers are FP16, other layers are INT8.
When I use the mix precision to convert all official yolov8*.pt, keep the first, second, last convolution layers FP16, other layers INT8, these converted engines have a good detection result in bus.jpg, and it size just bigger a little 200KB, because just three convolution layer use FP16 and others still use int8. So my codes is right, my conversion and calibration procedure are correct. I think yolov8 can add the mix precision to convert model for a better performance.
i. The procedure of converting a mix precision engine:
download coco val_2017.zip and unzip to val_2017.
set calib_input=./val_2017, cache_file=./calibration.cache, half=True, int8=True
run the script, it will using the mix precision mode convert the model to engine file and infer with bus.jpg.
before running, I marked the 322 line
raise SyntaxError(string + CLI_HELP_MSG) from e
in file ultralytics\cfg\init.py.ii. The procedure of converting a INT8 engine:all same beside this (half=False, int8=True)
the converted models (INT8 or MIX Precision) can download from here https://www.alipan.com/s/FdfFoPDGCWH in tensorrt 10.0.0b6. *mix is a mix precision engine, *int8 is a INT8 engine, nod is no detection.
mix precision log
these engines using mix precision have a good detection result in bus.jpg, and these engines using INT8 have no result.
related pull request #9840 , #9941 and #9969
difference to #9840
I change the input and output to FLOAT, not FP16, FLOAT input and output is better.
use different Calibrator base class
in document tensorrt/developer-guide, we can use differenct Calibrator base class in calibrator.py, IInt8EntropyCalibrator2 is suitable for CNN network, IInt8MinMaxCalibrator is suitable for NLP network, should try different calibrator .
int8 conversion reference
https://github.com/NVIDIA/TensorRT/blob/main/samples/python/detectron2
mix precision conversion reference (much better)
Tips: the firstly layer and last layer should be FP16, others can be int8. just try it.
https://github.com/NVIDIA/TensorRT/blob/main/samples/python/efficientdet
below codes can be used for the first commit commits/82ec7ccd7cf77353d9b39f87c317f88878a2a34b
I have read the CLA Document and I sign the CLA
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Introducing Mixed Precision & Enhanced INT8 Support for Accelerated Inference! ⚡🔍
📊 Key Changes
🎯 Purpose & Impact
⚙️🚀 Whether it's speeding up your existing models or pushing the envelope on accuracy, these updates are all about giving you the tools to make the most out of your AI solutions.