-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Isolation and flexible deployment strategies [FEA] #243
Comments
This also ties in loosly to what @MMelQin was mentioning about trying to have multiple models deployed in a MAP |
This is definitely a good request for a much needed capability, though more for a deployment platform, e.g. Clara inference operators/applications uses remote Triton Inference Service which supports model to GPU affinity, number of instances per model etc, so, Triton configuration can be used for distributing model instance(s) to GPU. App SDK does have an issue for utilizing remote Triton inference service, #212 As for multi-model support, #244, when all the inference operators use in-proc inference, it is possible to
|
Is your feature request related to a problem? Please describe.
If we consider a few scenarios where we need
In all these examples, we want to assign a GPU to a model and do not want the inference service to take up the entire system. If we can isolate the GPU and pin it to a particular deployment, it will be really useful. In addition, this will also future proof our deployments. Imagine a scenario where we get new GPUs with new architectures. Maybe the deployment and the model and pytorch versions do not work with the new architecture. In such a case, we can add more GPUs without disturbing the deployments.
Describe alternatives you've considered
@slbryson has tried GPU isolation using clara CLI tools.
Additional context
The text was updated successfully, but these errors were encountered: