Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation #3861

Open
Michelvl92 opened this issue May 13, 2024 · 4 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@Michelvl92
Copy link

Michelvl92 commented May 13, 2024

I want to achieve a full int8 model, with maximum speed optimizations in int8, but documentations is very unclear.

Fusion of nodes: Conv, Sigmoid, Mul, Add?

  • I know that: Conv, Sigmoid, Mul will be fused
  • Is it also possible to fuse Conv, Sigmoid, Mul, Add? If yes, how can I achieve this? This is unclear from doc. qdq-placement-recs

How should I handle Q/DQ nodes with concat?

  • Should all the inputs/outputs be Q/DQ nodes, anything else I should know?

How should I handle Q/DQ nodes with split?

    • Should all the input/output be Q/DQ nodes, anything else I should know?

Why are the Scale & PointWise operations introduced in my graph?

  1. for the two bottlenecks on the right there is some PointWise operation added for some strange reason, what is the reason for this, and why is it added?
  2. Furthermore from the first conv to the second conv there is a scale introduced, what is the reason for this?
  3. In the second bottleneck on the left there is a scale added, but not on the first one, what is the reason for this?
    All these operations introduces some extra latency, is the an option to omit this?

scale_pointwise

scale_pointwise_onnx

Strange behavior first "bottleneck" layer

There are extra reformat layers added in the first bottleneck layer? and why are the operations not fused? As you can see the Q/DQ nodes are correctly placed...

Environment

TensorRT Version:
8.6.3

ONNXVersion:
1.15.0, optset 17

Relevant Files

full trt graph

full onnx graph

@ttyio

@ttyio
Copy link
Collaborator

ttyio commented May 14, 2024

Is it also possible to fuse Conv, Sigmoid, Mul, Add?

Could you try remove the Q/DQ between mul and add. leave the Q/DQ on the other branch of the add.

How should I handle Q/DQ nodes with concat? How should I handle Q/DQ nodes with split?

The Q node commutes with concat/slice. usually we don't need handle their input specially.

for the two bottlenecks on the right there is some PointWise operation added for some strange reason.

Both of the attached file are screenshot of the onnx file, ont trt graph. The snapshot is fuzzy, I am not sure what those pointwise are. Could you upgrade your TRT to latest 10.0 to see if anything changed?

Furthermore from the first conv to the second conv there is a scale introduced, what is the reason for this?

Because your onnx has pattern Conv -> Q/DQ -> split -> Q/DQ -> Conv, it fused as Conv(INT8-OUT) -> DQ+Q -> Conv(INT8-IN), The dangling DQ+Q in the middle created a scale layer. could you try remove the Q/DQ pair after the first Conv?

@Michelvl92
Copy link
Author

Michelvl92 commented May 14, 2024

Could you try remove the Q/DQ between mul and add. leave the Q/DQ on the other branch of the add.

Tried this, but it is not getting fused.

The Q node commutes with concat/slice. usually we don't need handle their input specially.

So does this mean that no Q/DQ node is required as input? Is the Q/DQ node only required for computational operations?

Because your onnx has pattern Conv -> Q/DQ -> split -> Q/DQ -> Conv, it fused as Conv(INT8-OUT) -> DQ+Q -> Conv(INT8-IN), The dangling DQ+Q in the middle created a scale layer. could you try remove the Q/DQ pair after the first Conv?

So as you can see, from the pointwise on the left to the last concat/conv there is again a scale introduced, what is the reason for this? how is this fused? I assumed that you need to add a Q/DQ node after the add operation to have a int8 output, but still there is a scale introduced, how can i avoid this? Or is this again because of the the Q/DQ before the split.

Both of the attached file are screenshot of the onnx file, and trt graph. The snapshot is fuzzy, I am not sure what those pointwise are. Could you upgrade your TRT to latest 10.0 to see if anything changed?

Both onnx and engine graphs were added already in the first post, see relevant files. I cannot try TRT10 cuz 8.6 is the latest version supported by the latest jetpack version for jetson ago print.

@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label May 17, 2024
@ttyio
Copy link
Collaborator

ttyio commented May 22, 2024

Adding @nzmora for vis.

Both onnx and engine graphs were added already in the first post

When I open your attachment, both image are the same.

So does this mean that no Q/DQ node is required as input? Is the Q/DQ node only required for computational operations?

No Q/DQ required for concat/slide since they commutes with Q/DQ.

what is the reason for this? how is this fused?

I am not sure if this is 8.6 only issue. Have you tried run your model on TRT 10.0 with desktop GPU?

@Michelvl92
Copy link
Author

@ttyio, just a small update from my side (with updated onnx and engine graphs included).

As you can see from the graph most of (double Q/DQ0 operations) are fixed and or properly placed, which resulted in a reduction in reformat layers, and a model that is in my opinion fully int8.

But there are 3 strange things happening in the model which I do not understand:

  1. In the skip connections scale operations are introduced (which where not there). They take up to 14% of the latency budget. Is there a possibility to reduce this, or is this correct?

I am not sure if this is 8.6 only issue. Have you tried run your model on TRT 10.0 with desktop GPU?

I need to deploy models on Jetson systems, which only support TRT8.6. How will testing on RT 10.0 help you? Since it will be not possible to deploy the model on a Jetson if I am correct?

  1. In the ONNX graph there is 1 skip connections, but in the engine graph there are two, what is the reason for this? It does not seem to harm model accuracy, but they both introduce a scale operation which introduces extra latency.
  2. For some strange reason reformat in a slightly bigger model, but with the same architecture, in the first bottleneck layer reformat operations are introduced (but still everything is 8int), which are not introduced in the smaller model. Looking at the ONNX graph i do not see anything strange, what is the reason for this, am I doing something wrong?

model graph onnx
model graph TRT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants