Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Isolation and flexible deployment strategies [FEA] #243

Open
vikashg opened this issue Jan 21, 2022 · 2 comments
Open

GPU Isolation and flexible deployment strategies [FEA] #243

vikashg opened this issue Jan 21, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@vikashg
Copy link
Collaborator

vikashg commented Jan 21, 2022

Is your feature request related to a problem? Please describe.
If we consider a few scenarios where we need

  • to deploy multiple models for a single application.
  • deploy multiple models on the same machine with different GPU architectures.
  • lockin resources for deployment so I can do training with the remaining resources.

In all these examples, we want to assign a GPU to a model and do not want the inference service to take up the entire system. If we can isolate the GPU and pin it to a particular deployment, it will be really useful. In addition, this will also future proof our deployments. Imagine a scenario where we get new GPUs with new architectures. Maybe the deployment and the model and pytorch versions do not work with the new architecture. In such a case, we can add more GPUs without disturbing the deployments.

Describe alternatives you've considered
@slbryson has tried GPU isolation using clara CLI tools.

Additional context

@vikashg vikashg added the enhancement New feature or request label Jan 21, 2022
@vikashg
Copy link
Collaborator Author

vikashg commented Jan 21, 2022

This also ties in loosly to what @MMelQin was mentioning about trying to have multiple models deployed in a MAP

@MMelQin
Copy link
Collaborator

MMelQin commented Jan 22, 2022

This is definitely a good request for a much needed capability, though more for a deployment platform, e.g. Clara inference operators/applications uses remote Triton Inference Service which supports model to GPU affinity, number of instances per model etc, so, Triton configuration can be used for distributing model instance(s) to GPU.

App SDK does have an issue for utilizing remote Triton inference service, #212

As for multi-model support, #244, when all the inference operators use in-proc inference, it is possible to

  • link the operators in the app (application.add_flow()) in such a way that only one inference operator can run at any given time, such that GPU is not overloaded.
  • Potentially enhance the model loading logic in the App SDK base Application to make use of specific GPU if so configured, but this becomes moot if remote Triton is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants