BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477

infinitr0us · 2024-05-11T22:18:21Z

Describe the bug

Hi, I was trying to deploy the Docker image v0.11.0 with my machine with GPU and drivers (CUDA 12.0) installed. However, an error always pops out during the initialization of the docker container:
"/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)"

Actually, I was able to see the hardware information being correctly loaded in in the container Web UI:

But, I was never able to execute models on GPU. So, I am pretty sure that my GPU and CUDA environment is working (also tested with no issue on LocalAI docker image). I was wondering if it is a torch library related issue or a Xinference reltaed issue? Would appreciate any help and willing to provide more logs if needed.

To Reproduce

To help us to reproduce this bug, please provide information below:

official docker image: xprobe/xinference:v0.11.0, also tried on v0.10.x, v0.9.x, and v0.8.x with no luck
Cuda version: 12.0; Driver version: 525.105.17
I was able to deploy containers using CUDA successfully (e.g., LocalAI official docker image)
Full stack of the error.
/opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpegorlibpnginstalled before buildingtorchvision from source? warn(2024-05-11 21:55:37,671 xinference.core.supervisor 47 INFO Xinference supervisor 0.0.0.0:15251 started /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

Expected behavior

Expected to run the container smoothly with CUDA.

Additional context

Thanks again for your great project. I am willing to provide more info/logs if necessary.

The text was updated successfully, but these errors were encountered:

ChengjieLi28 · 2024-05-13T04:59:10Z

@infinitr0us Could you please help me test this:
Build a new image based on our offical image:

FROM xprobe/xinference:v0.11.0

RUN pip install torchvision==0.17.1

And then test it?

In my own machine with two GPUs, I can use xinference it normally with above method.

ChengjieLi28 · 2024-05-13T09:18:09Z

@infinitr0us Could you use this:

docker pull xprobe/xinference:nightly-bug_torchvision_version

to try again?

infinitr0us · 2024-05-13T19:09:42Z

@infinitr0us Could you please help me test this: Build a new image based on our offical image:
FROM xprobe/xinference:v0.11.0

RUN pip install torchvision==0.17.1
And then test it?

In my own machine with two GPUs, I can use xinference it normally with above method.

Thanks a lot for your response. I tried to install torchvision 0.17.1 in the official inference docker image v0.11.0 through
sudo docker exec xinference pip install torchvision==0.17.1
And it seems that at least the installation is working well, but the cudaGetDeviceCount() error is still there.

I am gonna give the nightly image a try and get back to you later.

infinitr0us · 2024-05-13T19:12:07Z

@infinitr0us Could you use this:
docker pull xprobe/xinference:nightly-bug_torchvision_version
to try again?

Oooops... it seems that the cudaGetDeviceCount() error is still there....
Okay, I will examine the torch libraries setup in LocalAI docker image, and see if there is any difference

infinitr0us · 2024-05-13T19:30:13Z

@ChengjieLi28 I tried with LocalAI again, and it seems that their docker image does not utilize local torch environment at all...

Now, I see the problem. It must be the torch library issue with my CUDA and driver environment. Thanks a lot for spending your time helping me diagnose this issue. I think if this issue is limited to my machine only, it is just an edge case. Thanks again and feel free to close it

XprobeBot added bug Something isn't working gpu labels May 11, 2024

XprobeBot added this to the v0.11.1 milestone May 11, 2024

ChengjieLi28 mentioned this issue May 13, 2024

BUG: Docker image issue due to torchvision #1485

Merged

ChengjieLi28 closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477

BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477

infinitr0us commented May 11, 2024

ChengjieLi28 commented May 13, 2024 •

edited

ChengjieLi28 commented May 13, 2024

infinitr0us commented May 13, 2024 •

edited

infinitr0us commented May 13, 2024

infinitr0us commented May 13, 2024 •

edited

BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477

BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477

Comments

infinitr0us commented May 11, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

ChengjieLi28 commented May 13, 2024 • edited

ChengjieLi28 commented May 13, 2024

infinitr0us commented May 13, 2024 • edited

infinitr0us commented May 13, 2024

infinitr0us commented May 13, 2024 • edited

ChengjieLi28 commented May 13, 2024 •

edited

infinitr0us commented May 13, 2024 •

edited

infinitr0us commented May 13, 2024 •

edited