[int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation #3861

Michelvl92 · 2024-05-13T17:24:58Z

I want to achieve a full int8 model, with maximum speed optimizations in int8, but documentations is very unclear.

Fusion of nodes: Conv, Sigmoid, Mul, Add?

I know that: Conv, Sigmoid, Mul will be fused
Is it also possible to fuse Conv, Sigmoid, Mul, Add? If yes, how can I achieve this? This is unclear from doc. qdq-placement-recs

How should I handle Q/DQ nodes with concat?

Should all the inputs/outputs be Q/DQ nodes, anything else I should know?

How should I handle Q/DQ nodes with split?

- Should all the input/output be Q/DQ nodes, anything else I should know?

Why are the Scale & PointWise operations introduced in my graph?

for the two bottlenecks on the right there is some PointWise operation added for some strange reason, what is the reason for this, and why is it added?
Furthermore from the first conv to the second conv there is a scale introduced, what is the reason for this?
In the second bottleneck on the left there is a scale added, but not on the first one, what is the reason for this?
All these operations introduces some extra latency, is the an option to omit this?

scale_pointwise

scale_pointwise_onnx

Strange behavior first "bottleneck" layer

There are extra reformat layers added in the first bottleneck layer? and why are the operations not fused? As you can see the Q/DQ nodes are correctly placed...

Environment

TensorRT Version:
8.6.3

ONNXVersion:
1.15.0, optset 17

Relevant Files

full trt graph

full onnx graph

@ttyio

ttyio · 2024-05-14T05:18:05Z

Is it also possible to fuse Conv, Sigmoid, Mul, Add?

Could you try remove the Q/DQ between mul and add. leave the Q/DQ on the other branch of the add.

How should I handle Q/DQ nodes with concat? How should I handle Q/DQ nodes with split?

The Q node commutes with concat/slice. usually we don't need handle their input specially.

for the two bottlenecks on the right there is some PointWise operation added for some strange reason.

Both of the attached file are screenshot of the onnx file, ont trt graph. The snapshot is fuzzy, I am not sure what those pointwise are. Could you upgrade your TRT to latest 10.0 to see if anything changed?

Furthermore from the first conv to the second conv there is a scale introduced, what is the reason for this?

Because your onnx has pattern Conv -> Q/DQ -> split -> Q/DQ -> Conv, it fused as Conv(INT8-OUT) -> DQ+Q -> Conv(INT8-IN), The dangling DQ+Q in the middle created a scale layer. could you try remove the Q/DQ pair after the first Conv?

Michelvl92 · 2024-05-14T06:33:25Z

Could you try remove the Q/DQ between mul and add. leave the Q/DQ on the other branch of the add.

Tried this, but it is not getting fused.

The Q node commutes with concat/slice. usually we don't need handle their input specially.

So does this mean that no Q/DQ node is required as input? Is the Q/DQ node only required for computational operations?

Because your onnx has pattern Conv -> Q/DQ -> split -> Q/DQ -> Conv, it fused as Conv(INT8-OUT) -> DQ+Q -> Conv(INT8-IN), The dangling DQ+Q in the middle created a scale layer. could you try remove the Q/DQ pair after the first Conv?

So as you can see, from the pointwise on the left to the last concat/conv there is again a scale introduced, what is the reason for this? how is this fused? I assumed that you need to add a Q/DQ node after the add operation to have a int8 output, but still there is a scale introduced, how can i avoid this? Or is this again because of the the Q/DQ before the split.

Both of the attached file are screenshot of the onnx file, and trt graph. The snapshot is fuzzy, I am not sure what those pointwise are. Could you upgrade your TRT to latest 10.0 to see if anything changed?

Both onnx and engine graphs were added already in the first post, see relevant files. I cannot try TRT10 cuz 8.6 is the latest version supported by the latest jetpack version for jetson ago print.

ttyio · 2024-05-22T16:40:20Z

Adding @nzmora for vis.

Both onnx and engine graphs were added already in the first post

When I open your attachment, both image are the same.

So does this mean that no Q/DQ node is required as input? Is the Q/DQ node only required for computational operations?

No Q/DQ required for concat/slide since they commutes with Q/DQ.

what is the reason for this? how is this fused?

I am not sure if this is 8.6 only issue. Have you tried run your model on TRT 10.0 with desktop GPU?

Michelvl92 · 2024-05-29T13:57:52Z

@ttyio, just a small update from my side (with updated onnx and engine graphs included).

As you can see from the graph most of (double Q/DQ0 operations) are fixed and or properly placed, which resulted in a reduction in reformat layers, and a model that is in my opinion fully int8.

But there are 3 strange things happening in the model which I do not understand:

In the skip connections scale operations are introduced (which where not there). They take up to 14% of the latency budget. Is there a possibility to reduce this, or is this correct?

I am not sure if this is 8.6 only issue. Have you tried run your model on TRT 10.0 with desktop GPU?

I need to deploy models on Jetson systems, which only support TRT8.6. How will testing on RT 10.0 help you? Since it will be not possible to deploy the model on a Jetson if I am correct?

In the ONNX graph there is 1 skip connections, but in the engine graph there are two, what is the reason for this? It does not seem to harm model accuracy, but they both introduce a scale operation which introduces extra latency.
For some strange reason reformat in a slightly bigger model, but with the same architecture, in the first bottleneck layer reformat operations are introduced (but still everything is 8int), which are not introduced in the smaller model. Looking at the ONNX graph i do not see anything strange, what is the reason for this, am I doing something wrong?

model graph onnx
model graph TRT

zerollzeng assigned ttyio May 17, 2024

zerollzeng added the triaged Issue has been triaged by maintainers label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation #3861

[int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation #3861

Michelvl92 commented May 13, 2024 •

edited

ttyio commented May 14, 2024

Michelvl92 commented May 14, 2024 •

edited

ttyio commented May 22, 2024

Michelvl92 commented May 29, 2024

[int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation #3861

[int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation #3861

Comments

Michelvl92 commented May 13, 2024 • edited

Environment

Relevant Files

ttyio commented May 14, 2024

Michelvl92 commented May 14, 2024 • edited

ttyio commented May 22, 2024

Michelvl92 commented May 29, 2024

Michelvl92 commented May 13, 2024 •

edited

Michelvl92 commented May 14, 2024 •

edited