💡[feat] Track each task GPU utilization (and other information) and display it on the WebUI #5043

tshu-w · 2022-09-16T05:51:45Z

Describe the problem

It would be very convenient for cluster managers and users if the WebUI could display system information for each task.

Here are screenshots from wandb:

Describe the solution you'd like

Users can view these chart on WebUI

Describe alternatives you've considered

No response

Additional context

cc @luyaojie

ioga · 2022-09-16T15:29:55Z

Hello, please check out our profiling feature:

tshu-w · 2022-09-17T05:38:56Z

Hi @ioga, If I understand correctly, do the system metrics and profiling only work with experiments that use PyTorchTrial or other wrappers? To help our colleagues start quickly, we currently run tasks mainly through commands or notebook. Do the SYSTEM METRICS be tracked in this case? I cannot find it on the WebUI.

ioga · 2022-09-19T23:46:30Z

@tshu-w correct, today it needs to be a PyTorchTrial or TFKerasTrial. There's no metrics collections for commands or notebooks, but I can totally see this being useful as a future feature.

vishnu2kmohan · 2022-09-22T16:24:50Z

@tshu-w Note: You can setup Prometheus and Grafana to monitor the usage of all Determined workloads: https://docs.determined.ai/latest/integrations/prometheus/prometheus.html

tshu-w · 2022-09-23T01:33:06Z

Thanks, @vishnu2kmohan, is it possible to monitor each GPU and know which task is running on it intuitively with Prometheus and Grafana? Our intent is to monitor GPU utilization per task to ensure that resources are not wasted.

huangfuhuijie · 2023-10-23T11:50:39Z

Thanks, @vishnu2kmohan, is it possible to monitor each GPU and know which task is running on it intuitively with Prometheus and Grafana? Our intent is to monitor GPU utilization per task to ensure that resources are not wasted.

Hi, I also encountered this situation. So have you figured out is it works to use Prometheus and Grafana to monitor GPU utilization on every task?

azhou-determined · 2023-10-23T21:05:53Z

See https://docs.determined.ai/latest/integrations/prometheus/prometheus.html.

We provide a pre-configured Grafana panel for monitoring hardware metrics including GPU utilization. We currently only provide preset filters for tags and resource-pool.

However, we do surface various container/task mappings through a Prometheus API endpoint (prom/det-state-metrics), so with some Grafana/PromQL fiddling, it's possible to modify our Grafana panel to add custom queries/filters (on task ID for example).

tshu-w added the feature Feature requests label Sep 16, 2022

tshu-w changed the title ~~💡[feat] Track task GPU utilization (and other information) and display it on the WebUI~~ 💡[feat] Track each task GPU utilization (and other information) and display it on the WebUI Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡[feat] Track each task GPU utilization (and other information) and display it on the WebUI #5043

💡[feat] Track each task GPU utilization (and other information) and display it on the WebUI #5043

tshu-w commented Sep 16, 2022 •

edited

ioga commented Sep 16, 2022

tshu-w commented Sep 17, 2022 •

edited

ioga commented Sep 19, 2022 •

edited

vishnu2kmohan commented Sep 22, 2022

tshu-w commented Sep 23, 2022

huangfuhuijie commented Oct 23, 2023

azhou-determined commented Oct 23, 2023

💡[feat] Track each task GPU utilization (and other information) and display it on the WebUI #5043

💡[feat] Track each task GPU utilization (and other information) and display it on the WebUI #5043

Comments

tshu-w commented Sep 16, 2022 • edited

Describe the problem

Describe the solution you'd like

Describe alternatives you've considered

Additional context

ioga commented Sep 16, 2022

tshu-w commented Sep 17, 2022 • edited

ioga commented Sep 19, 2022 • edited

vishnu2kmohan commented Sep 22, 2022

tshu-w commented Sep 23, 2022

huangfuhuijie commented Oct 23, 2023

azhou-determined commented Oct 23, 2023

tshu-w commented Sep 16, 2022 •

edited

tshu-w commented Sep 17, 2022 •

edited

ioga commented Sep 19, 2022 •

edited