Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polling Nvidia temperature keeps GPU awake #1291

Open
1 task done
flukejones opened this issue Aug 31, 2023 · 17 comments
Open
1 task done

Polling Nvidia temperature keeps GPU awake #1291

flukejones opened this issue Aug 31, 2023 · 17 comments
Labels
bug Something isn't working the way that is expected.

Comments

@flukejones
Copy link

Checklist

Describe the feature request

I noticed in a recent update the sensors tab (on linux) gained the dGPU temperature. On hybrid systems this is an issue as it causes the dGPU to stay awake and drain battery.

I can't see any easy option to disable this one sensor.

@flukejones flukejones added the feature Requests for a new feature. label Aug 31, 2023
@ClementTsang
Copy link
Owner

This seems more like a bug - could you fill in the bug report form?

@ClementTsang ClementTsang added bug Something isn't working the way that is expected. feature Requests for a new feature. and removed feature Requests for a new feature. labels Aug 31, 2023
@ClementTsang
Copy link
Owner

ClementTsang commented Aug 31, 2023

That said, I could also look into adding GPU filtering, yes. Curious how that might look like though - would filtering by PCI info seem too confusing?

@ClementTsang
Copy link
Owner

Alternatively, I could filter by name + add options to disable any GPU activities for certain GPU names, in addition to more granular filtering for other widgets. Does the current dGPU show up by name in the temperatures tab? If you have a screenshot, that would be helpful.

@jamartin9
Copy link
Contributor

jamartin9 commented Aug 31, 2023

filter by name + add options to disable any GPU activities for certain GPU names

I like the idea. It should probably be done by index; to avoid device initialization by nvml's device_by_index while getting the name.

Alternatively a white list based approach could support uuid/pcie names pretty easily via device_by_pci_bus_id and device_by_uuid

Edit: Short term build without the gpu feature flag. PR 1276 should allow disabling of the gpu via config until filtering is done. This was probably introduced around 0.7.0

@jamartin9 jamartin9 mentioned this issue Nov 19, 2023
10 tasks
@yump
Copy link

yump commented Dec 5, 2023

Some/all AMD GPUs are also affected. I have an RX580 that doesn't drive any monitors, and reading the hwmons wakes it up and keeps it awake. Unfortunately, it seems the device/power_state file is the only thing I can read without waking the GPU, so in my fan control script I had work around this by modeling the GPU's idle poweroff logic.

The model is an ON/WARM/OFF state machine, where ON reads sensors and utilization, and transitions to WARM if utilization is 0 for some time, and WARM reads no sensors or util but transitions to OFF if the power_state file changes to D3hot, or to ON if it's still in D0 after elapsed time exceeds a value greater than the GPU's idle power off timeout. OFF transitions to ON if power_state shows D0.

Theoretically you could also see D3cold that saves even more power, but the motherboard has to support it somehow and mine seemingly doesn't.

Hmm... It seems that this should perhaps be fixed in the kernel. I have written a note to myself to report this to the hwmon mailing list/bug tracker.

@ClementTsang
Copy link
Owner

ClementTsang commented Dec 11, 2023

bottom already actually does a fairly simple check with device/power_state, and only grabbing further sensor data if it either did not exist, or was D0/unknown, so yeah I might need to make it a bit more sophisticated with checks.... that or my implementation is bugged. It's a bit frustrating too since I don't think I have any way to debug this at the moment.

@ClementTsang
Copy link
Owner

If anyone can check, would be interested to see if a simple logic change in #1355 helps with it.

@flukejones
Copy link
Author

@ClementTsang I've tried that branch, is it supposed to show Nvidia/GPU temps if it is already active? Currently it does not.

@ClementTsang
Copy link
Owner

The change would hide any entry for any device that's asleep; if it turns back on though in theory it should show up again...

@ClementTsang
Copy link
Owner

Mostly also just curious whether it stops the GPU from waking, or if there's more that I need to do in that part first.

@flukejones
Copy link
Author

Mostly also just curious whether it stops the GPU from waking, or if there's more that I need to do in that part first.

Seems like I don't.

@ClementTsang
Copy link
Owner

Hm, so the GPU is still waking up?

@flukejones
Copy link
Author

Sorry mate. It looks like I had a brainfart.. The dgpu appears to not be waking.

@ClementTsang
Copy link
Owner

ClementTsang commented Dec 20, 2023

Just merged #1355, could you see in main if the output looks reasonable for you and doesn't wake up the dgpu? Thanks!

@ClementTsang ClementTsang removed the feature Requests for a new feature. label Dec 20, 2023
@flukejones
Copy link
Author

It doesn't wake it, but also does not show details if it is awake? It may be worth reading through this also https://gitlab.com/mission-center-devs/mission-center/-/issues/30#note_1697130114

@ClementTsang
Copy link
Owner

Hmm... that's weird, thanks for the link. Also just curious, could you provide screenshots of what the temp table looks like on stable and on main now? Thanks!

@ClementTsang
Copy link
Owner

ClementTsang commented Jan 5, 2024

🤦 just realized that I never changed the sleep checks for nvidia GPUs... let me try looking at that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working the way that is expected.
Projects
None yet
Development

No branches or pull requests

4 participants