Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0 #1477

Closed
infinitr0us opened this issue May 11, 2024 · 5 comments
Labels
bug Something isn't working gpu
Milestone

Comments

@infinitr0us
Copy link

Describe the bug

Hi, I was trying to deploy the Docker image v0.11.0 with my machine with GPU and drivers (CUDA 12.0) installed. However, an error always pops out during the initialization of the docker container:
"/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)"

Actually, I was able to see the hardware information being correctly loaded in in the container Web UI:
image
But, I was never able to execute models on GPU. So, I am pretty sure that my GPU and CUDA environment is working (also tested with no issue on LocalAI docker image). I was wondering if it is a torch library related issue or a Xinference reltaed issue? Would appreciate any help and willing to provide more logs if needed.

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. official docker image: xprobe/xinference:v0.11.0, also tried on v0.10.x, v0.9.x, and v0.8.x with no luck
  2. Cuda version: 12.0; Driver version: 525.105.17
    image
  3. I was able to deploy containers using CUDA successfully (e.g., LocalAI official docker image)
  4. Full stack of the error.
    /opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpegorlibpnginstalled before buildingtorchvision from source? warn(2024-05-11 21:55:37,671 xinference.core.supervisor 47 INFO Xinference supervisor 0.0.0.0:15251 started /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

Expected behavior

Expected to run the container smoothly with CUDA.

Additional context

Thanks again for your great project. I am willing to provide more info/logs if necessary.

@XprobeBot XprobeBot added bug Something isn't working gpu labels May 11, 2024
@XprobeBot XprobeBot added this to the v0.11.1 milestone May 11, 2024
@ChengjieLi28
Copy link
Contributor

ChengjieLi28 commented May 13, 2024

@infinitr0us Could you please help me test this:
Build a new image based on our offical image:

FROM xprobe/xinference:v0.11.0

RUN pip install torchvision==0.17.1

And then test it?

In my own machine with two GPUs, I can use xinference it normally with above method.

@ChengjieLi28
Copy link
Contributor

@infinitr0us Could you use this:

docker pull xprobe/xinference:nightly-bug_torchvision_version

to try again?

@infinitr0us
Copy link
Author

infinitr0us commented May 13, 2024

@infinitr0us Could you please help me test this: Build a new image based on our offical image:

FROM xprobe/xinference:v0.11.0

RUN pip install torchvision==0.17.1

And then test it?

In my own machine with two GPUs, I can use xinference it normally with above method.

Thanks a lot for your response. I tried to install torchvision 0.17.1 in the official inference docker image v0.11.0 through
sudo docker exec xinference pip install torchvision==0.17.1
And it seems that at least the installation is working well, but the cudaGetDeviceCount() error is still there.
image
I am gonna give the nightly image a try and get back to you later.

@infinitr0us
Copy link
Author

@infinitr0us Could you use this:

docker pull xprobe/xinference:nightly-bug_torchvision_version

to try again?

Oooops... it seems that the cudaGetDeviceCount() error is still there....
Okay, I will examine the torch libraries setup in LocalAI docker image, and see if there is any difference

@infinitr0us
Copy link
Author

infinitr0us commented May 13, 2024

@ChengjieLi28 I tried with LocalAI again, and it seems that their docker image does not utilize local torch environment at all...
image
Now, I see the problem. It must be the torch library issue with my CUDA and driver environment. Thanks a lot for spending your time helping me diagnose this issue. I think if this issue is limited to my machine only, it is just an edge case. Thanks again and feel free to close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu
Projects
None yet
Development

No branches or pull requests

3 participants