New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477
Comments
@infinitr0us Could you please help me test this:
And then test it? In my own machine with two GPUs, I can use xinference it normally with above method. |
@infinitr0us Could you use this:
to try again? |
Thanks a lot for your response. I tried to install torchvision 0.17.1 in the official inference docker image v0.11.0 through |
Oooops... it seems that the cudaGetDeviceCount() error is still there.... |
@ChengjieLi28 I tried with LocalAI again, and it seems that their docker image does not utilize local torch environment at all... |
Describe the bug
Hi, I was trying to deploy the Docker image v0.11.0 with my machine with GPU and drivers (CUDA 12.0) installed. However, an error always pops out during the initialization of the docker container:
"/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)"
Actually, I was able to see the hardware information being correctly loaded in in the container Web UI:
But, I was never able to execute models on GPU. So, I am pretty sure that my GPU and CUDA environment is working (also tested with no issue on LocalAI docker image). I was wondering if it is a torch library related issue or a Xinference reltaed issue? Would appreciate any help and willing to provide more logs if needed.
To Reproduce
To help us to reproduce this bug, please provide information below:
/opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from
torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have
libjpegor
libpnginstalled before building
torchvisionfrom source? warn(2024-05-11 21:55:37,671 xinference.core.supervisor 47 INFO Xinference supervisor 0.0.0.0:15251 started /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0
Expected behavior
Expected to run the container smoothly with CUDA.
Additional context
Thanks again for your great project. I am willing to provide more info/logs if necessary.
The text was updated successfully, but these errors were encountered: