Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃挕[feat] Track each task GPU utilization (and other information) and display it on the WebUI #5043

Open
tshu-w opened this issue Sep 16, 2022 · 7 comments
Labels
feature Feature requests

Comments

@tshu-w
Copy link

tshu-w commented Sep 16, 2022

Describe the problem

It would be very convenient for cluster managers and users if the WebUI could display system information for each task.

Here are screenshots from wandb:

Screen Shot 2022-09-16 at 13 49 47

Screen Shot 2022-09-16 at 13 49 56

Describe the solution you'd like

Users can view these chart on WebUI

Describe alternatives you've considered

No response

Additional context

cc @luyaojie

@tshu-w tshu-w added the feature Feature requests label Sep 16, 2022
@tshu-w tshu-w changed the title 馃挕[feat] Track task GPU utilization (and other information) and display it on the WebUI 馃挕[feat] Track each task GPU utilization (and other information) and display it on the WebUI Sep 16, 2022
@tshu-w
Copy link
Author

tshu-w commented Sep 17, 2022

Hi @ioga, If I understand correctly, do the system metrics and profiling only work with experiments that use PyTorchTrial or other wrappers? To help our colleagues start quickly, we currently run tasks mainly through commands or notebook. Do the SYSTEM METRICS be tracked in this case? I cannot find it on the WebUI.

@ioga
Copy link
Contributor

ioga commented Sep 19, 2022

@tshu-w correct, today it needs to be a PyTorchTrial or TFKerasTrial. There's no metrics collections for commands or notebooks, but I can totally see this being useful as a future feature.

@vishnu2kmohan
Copy link
Contributor

@tshu-w Note: You can setup Prometheus and Grafana to monitor the usage of all Determined workloads: https://docs.determined.ai/latest/integrations/prometheus/prometheus.html

@tshu-w
Copy link
Author

tshu-w commented Sep 23, 2022

Thanks, @vishnu2kmohan, is it possible to monitor each GPU and know which task is running on it intuitively with Prometheus and Grafana? Our intent is to monitor GPU utilization per task to ensure that resources are not wasted.

@huangfuhuijie
Copy link

Thanks, @vishnu2kmohan, is it possible to monitor each GPU and know which task is running on it intuitively with Prometheus and Grafana? Our intent is to monitor GPU utilization per task to ensure that resources are not wasted.

Hi, I also encountered this situation. So have you figured out is it works to use Prometheus and Grafana to monitor GPU utilization on every task?

@azhou-determined
Copy link
Contributor

See https://docs.determined.ai/latest/integrations/prometheus/prometheus.html.

We provide a pre-configured Grafana panel for monitoring hardware metrics including GPU utilization. We currently only provide preset filters for tags and resource-pool.

However, we do surface various container/task mappings through a Prometheus API endpoint (prom/det-state-metrics), so with some Grafana/PromQL fiddling, it's possible to modify our Grafana panel to add custom queries/filters (on task ID for example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature requests
Projects
None yet
Development

No branches or pull requests

5 participants