-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668
Comments
This means you have no GPUs available. Can you run |
This seems to be an issue on your environment/system then unfortunately. |
@jldroid19 did you figure the issue out? |
@psinger I have not. |
are you running this in docker? |
Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost. |
I stumbled upon this recently, might be related: NVIDIA/nvidia-container-toolkit#465 (comment) There seems to be some issue of gpus being suddenly gone in Docker. |
🐛 Bug
q.app
q.user
q.client
report_error: True
q.events
q.args
report_error: True
stacktrace
Traceback (most recent call last):
File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)
File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))
File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()
File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])
ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’
Error
None
Git Version
fatal: not a git repository (or any of the parent directories): .git
To Reproduce
I'm not sure why this is happening. Hard to reproduce
LLM Studio version
v1.4.0-dev
The text was updated successfully, but these errors were encountered: