Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about input and output shape in model configuration when batch size is 1 #7227

Closed
jackylu0124 opened this issue May 16, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@jackylu0124
Copy link

Hey all, I have a question regarding the input and output shape configuration in the model configuration file. Basically I have a model that takes in images in the NCHW layout (more specifically C=3 and H and W can have variable-size positive integer values), and the model also outputs tensors in the NCHW layout (more specifically C=3, and H and W can have variable-size positive integer values). Due to the relatively large size of this model and also due to the limited memory on my GPU, I want to set a batch size of 1 for both the input and output tensor.

Based on my understanding of the following paragraph in the documentation https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html:

Input and output shapes are specified by a combination of max_batch_size and the dimensions specified by the input or output dims property. For models with max_batch_size greater-than 0, the full shape is formed as [ -1 ] + dims. For models with max_batch_size equal to 0, the full shape is formed as dims.

my questions are:
1. Are the following 2 configurations for the model input and output shapes equivalent, and do they have identical effects for specifying the input and output shape for the model?
2. From the inference server's point of view, are these two configurations treated in any different ways or are they indistinguishable from the inference server's point of view?
3. If the following 2 configurations are equivalent, are they considered to be equivalent by all the backends supported by the Triton Inference Server (e.g. onnxruntime backend, python backend, etc.)?

Configuration 1:

max_batch_size: 1

input [
    {
        name: "input"
        data_type: TYPE_FP32
        dims: [3, -1, -1]
    }
]

output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [3, -1, -1]
    }
]

Configuration 2:

max_batch_size: 0

input [
    {
        name: "input"
        data_type: TYPE_FP32
        dims: [1, 3, -1, -1]
    }
]

output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [1, 3, -1, -1]
    }
]

Thank you very much for your time and help in advance!

@statiraju statiraju added the investigating The developement team is investigating this issue label May 16, 2024
@statiraju
Copy link

@tanmayv25 can you help here?

@tanmayv25 tanmayv25 added question Further information is requested and removed investigating The developement team is investigating this issue labels May 20, 2024
@tanmayv25
Copy link
Contributor

  1. Are the following 2 configurations for the model input and output shapes equivalent, and do they have identical effects for specifying the input and output shape for the model?

Both the configurations are identical. The client in both the cases have to provide input with shape [1,3,-1,-1], where -1 can be any positive integer number. And the received output will be of shape [1,3,-1,-1].

They have an identical impact as in Triton core will forward requests with input shape [1,3,-1,-1] to the backend and receive output of shape [1,3,-1,-1] from the backend.

  1. From the inference server's point of view, are these two configurations treated in any different ways or are they indistinguishable from the inference server's point of view?

Point of view will be identical. The only difference would be when dynamic_batching field is enabled. When the field is set with max_batch_size = 1, the request would go into an additional queue, and would be picked as soon as there is an available instance for execution. When dynamic batching is disabled, there is no difference even in the request control flow.

  1. If the following 2 configurations are equivalent, are they considered to be equivalent by all the backends supported by the Triton Inference Server (e.g. onnxruntime backend, python backend, etc.)?

They will be completely identical from backend's perspective. Even more so max_batch_size value is not even propagated to the backend during inference execution. However, the backend's during auto-completion of model config, may enable dynamic_batching setting which can introduce an extra queue transaction in the control flow.
None of the standard backends in my knowledge do that for max_batch_size=1 (only when max_batch_size > 1). The tensorflow behavior can be found here: https://github.com/triton-inference-server/tensorflow_backend?tab=readme-ov-file#dynamic-batching

@jackylu0124
Copy link
Author

@tanmayv25 Got it, thank you very much for your detailed explanation and clarification! I really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants