Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU variant has issues recognizing the GPU. #1035

Open
1 of 3 tasks
scepterus opened this issue Sep 28, 2023 · 43 comments
Open
1 of 3 tasks

GPU variant has issues recognizing the GPU. #1035

scepterus opened this issue Sep 28, 2023 · 43 comments
Labels
bug Something isn't working

Comments

@scepterus
Copy link

scepterus commented Sep 28, 2023

πŸ› Bug Report

  • πŸ“ I've Included a ZIP file containing my librephotos log files
  • ❌ I have looked for similar issues (including closed ones)
  • 🎬 (If applicable) I've provided pictures or links to videos that clearly demonstrate the issue

πŸ“ Description of issue:

When scanning in the new GPU docker variant, I get the following errors:

/usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet.
  paginator = self.django_paginator_class(queryset, page_size)
  return torch._C._cuda_getDeviceCount() > 0
  File "/usr/local/lib/python3.10/dist-packages/django_q/worker.py", line 88, in worker
    res = f(task["args"], **task["kwargs"])
  File "/code/api/directory_watcher.py", line 411, in face_scan_job
    photo._extract_faces()
  File "/code/api/models/photo.py", line 729, in _extract_faces
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/init.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)

Also attached.
message.txt

πŸ” How can we reproduce it:

Have a docker with nvidia gpu, in my case 1050, it does not get recognized in server stats.

Please provide additional information:

  • πŸ’» Operating system:
  • βš™ Architecture (x86 or ARM): x86
  • πŸ”’ Librephotos version: latest
  • πŸ“Έ Librephotos installation method (Docker, Kubernetes, .deb, etc.): Docker
    • πŸ‹ If Docker or Kubernets, provide docker-compose image tag: Latest
  • πŸ“ How is you picture library mounted (Local file system (Type), NFS, SMB, etc.): Direct
  • ☁ If you are virtualizing librephotos, Virtualization platform (Proxmox, Xen, HyperV, etc.):
@scepterus scepterus added the bug Something isn't working label Sep 28, 2023
@scepterus
Copy link
Author

scepterus commented Oct 6, 2023

@derneuere
I found the issue.
CUDA was not seen correctly due to this part:

deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [all]

Once I removed that, and ran nvidia-smi CUDA showed the version that's installed.
However, now I face another issue, where the CUDA installed in the docker is older that what I have on the host.
So I get this:

05:13:14 [Q] ERROR Failed 'api.directory_watcher.face_scan_job' (illinois-asparagus-stream-tennis) - Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 804, reason: forward compatibility was attempted on non supported HW : Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/django_q/worker.py", line 88, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/code/api/directory_watcher.py", line 411, in face_scan_job
    photo._extract_faces()
  File "/code/api/models/photo.py", line 729, in _extract_faces
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 804, reason: forward compatibility was attemp

You might want to update the guide and remove those deploy things if that's the case for everyone.
Will try to update the CUDA in the container and see what happens.

@scepterus
Copy link
Author

UPDATE:
Just found out this needs to be done in the docker file. So we'll need to figure this out for everyone. Maybe a check to see which CUDA is installed, then populate the version that's being pulled?

@scepterus
Copy link
Author

@derneuere Any chance to get this fixed? I do not want to go back to cpu if this will be fixed soon, but right now this thing is totally broken.

@derneuere
Copy link
Member

dlib is compiled against a specific version of cuda, which is in this case Cuda 11.7.1 with cudnn8

It complains that "forward compatibility" was attempted and failed. This means that the host system has likely old drivers. That could either be an issue, that the graphic card is old or the drivers are old.

Graphics card cannot be the reason as I develop on a system with a 1050ti max q, which works fine.

Please update the driver or change the deploy part. On my system I use

      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

I can't make dlib compatible against multiple versions, compiling it during runtime will lead to half an hour start up time and replacing it with something more flexible is not doable for me atm due to time constraints.

@scepterus
Copy link
Author

As you can see in my previous comment, if I add that part to the compose, cuda is not detected inside the container.
The error you see that I attached was when the host machine has cude 12, while the container has cuda 11. It still calls it forward compatibility.

@derneuere
Copy link
Member

Hmm, I will try to bump everything to CUDA 12. According to the docs, it should be backwards compatible. Let's see if that actually works.

@scepterus
Copy link
Author

Cool, let me know if I can help.

@derneuere
Copy link
Member

Alright I pushed a fix. Should be available in half an hour. Let me know if that fixes the issue for you :)

@scepterus
Copy link
Author

Is this on dev or stable?

@derneuere
Copy link
Member

Only on dev for now :)

@scepterus
Copy link
Author

Ah, can I pull just gpu-dev by adding -dev to it in the docker compose file?

@derneuere
Copy link
Member

Yes, works the same way as the other image :)

@scepterus
Copy link
Author

Sadly, I've been trying to download that image for 2 days now, it just hangs and times out. I need to restart and hope it fully downloads.

@scepterus
Copy link
Author

scepterus commented Nov 4, 2023

INFO:ownphotos:Can't extract face information on photo: photo INFO:ownphotos:HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: /face-locations (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f51db061ed0>: Failed to establish a new connection: [Errno 111] Connection refused'))

with the latest dev gpu.

[2023-11-04 16:38:56 +0000] [12097] [INFO] Autorestarting worker after current request. /usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet. paginator = self.django_paginator_class(queryset, page_size) [2023-11-04 16:38:57 +0000] [12097] [INFO] Worker exiting (pid: 12097) [2023-11-04 18:38:57 +0200] [16597] [INFO] Booting worker with pid: 16597 use SECRET_KEY from file /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

@scepterus
Copy link
Author

@derneuere any idea how we move past this?

@derneuere
Copy link
Member

I can't reproduce this and I am pretty sure, that this issue is not on my side. Do other GPU accelerated images work for you?

Currently the only bug I can reproduce is #1056

@scepterus
Copy link
Author

Last time I tested the cuda test container it worked, let me verify that now.

@scepterus
Copy link
Author

scepterus commented Nov 9, 2023

== CUDA ==
==========
CUDA Version 12.2.2
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Thu Nov  9 04:57:47 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:08:00.0 Off |                  N/A |
|  0%   47C    P0              N/A /  70W |      0MiB /  2048MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here's the output, looks like it is working inside the docker.

@scepterus
Copy link
Author

scepterus commented Nov 9, 2023

I added the parts in the docker compose file back like it says in the guide, and this is what I get now:

thumbnail: service starting
Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 4, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)

When I connect to the container and do nvidia-smi it outputs correctly:

Thu Nov  9 07:11:58 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:08:00.0 Off |                  N/A |
|  0%   47C    P0              N/A /  70W |      0MiB /  2048MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@scepterus
Copy link
Author

scepterus commented Dec 1, 2023

@derneuere after the update last night to the backend, things changed.
When I did a scan for new photos, it managed to extract stuff from them, but I get this error:

INFO:ownphotos:Can't extract face information on photo: /location/photo.png
INFO:ownphotos:HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: /face-locations (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc88e1979d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

@scepterus
Copy link
Author

scepterus commented Dec 1, 2023

Also, a few things like these:

[2023-12-01 06:24:35 +0000] [3773] [INFO] Autorestarting worker after current request.
[2023-12-01 06:24:36 +0000] [3773] [INFO] Worker exiting (pid: 3773)
[2023-12-01 08:24:36 +0200] [3785] [INFO] Booting worker with pid: 3785
use SECRET_KEY from file

You'll notice the top 2 are in GMT, and the last one is GMT+2. That might cause issues if you're comparing both, and setting a timeout based on the difference.

These messages repeat a few times.
I hope this helps you narrow down the issues.

Side note:

INFO:ownphotos:Could not handle /location/IMG_20071010_150554_2629.jxl, because unable to call thumbnail
  VipsForeignLoad: "/location//IMG_20071010_150554_2629.jxl" is not a known file format

Wasn't jxl fixed?

@derneuere
Copy link
Member

JXL is handled by thumbnail-service / imagemagick and not by vips. Can you look into the log files for face-service and thumbnail-service and post possible errors here?

@scepterus
Copy link
Author

Regarding the JXL, so why is it erroring out if it's not supposed to handle these files?
as for the logs, can you be more specific?
I found in the logs folder only face-recognition.log, and here's the output of it:

cat face_recognition.log 
Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 1, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error

@scepterus
Copy link
Author

scepterus commented Dec 12, 2023

Here's what I get when loading the latest backend:

/usr/local/lib/python3.10/dist-packages/picklefield/fields.py:78: RuntimeWarning: Pickled model instance's Django version 4.2.7 does not match the current version 4.2.8.
  return loads(value)
/usr/local/lib/python3.10/dist-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'api.models.person.Person'> QuerySet.
  paginator = self.django_paginator_class(queryset, page_size)

Unauthorized: /api/albums/date/list/
  
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0

@derneuere
Copy link
Member

Still the same error, that the backend can't find the gpu. I think this has something to do with docker or pytorch and not with librephotos. Can you look for similar issues and check if other containers which support gpu acceleration work?

@scepterus
Copy link
Author

If I see CUDA correctly in the test docker from nvidia, is that enough to rule out the infrastructure? Or is there another test that will definitely prove this?

@scepterus
Copy link
Author

Forget what I said, I just ran:
nvidia-smi
inside the backend, and it works, so the container can reach the GPU. It must be an issue in the software.

@scepterus
Copy link
Author

scepterus commented Dec 20, 2023

Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 1, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

here's my env:

  backend:
    image: reallibrephotos/librephotos-gpu:dev
    container_name: backend
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

yet as mentioned CUDA is seen.

@scepterus
Copy link
Author

==========
== CUDA ==
==========
CUDA Version 12.1.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Wed Dec 20 07:19:30 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |

I added nvidia-smi to the start of entrypoint, because bing co-pilot suggested I run smi before any command. Here's the output.
I have created this pull request:
https://github.com/LibrePhotos/librephotos-docker/pull/113
so we can update that going forward.
Also, see the deprecation notice, I think we need to keep on the latest image for CUDA for it to function properly.

@derneuere
Copy link
Member

I think this here is the correct issue from pytorch. pytorch/pytorch#49081 I added nvidia modprobe to the container. Lets see if that works.

@scepterus
Copy link
Author

Traceback (most recent call last):
  File "/code/service/face_recognition/main.py", line 1, in <module>
    import face_recognition
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/__init__.py", line 7, in <module>
    from .api import load_image_file, face_locations, batch_face_locations, face_landmarks, face_encodings, compare_faces, face_distance
  File "/usr/local/lib/python3.10/dist-packages/face_recognition/api.py", line 26, in <module>
    cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /tmp/builds/dlib/dlib/cuda/gpu_data.cpp:204. code: 999, reason: unknown error
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

Still this. Are you initiating the CUDA drivers and visible devices before the code? It does not look like my pull request was merged, so I could see the result of nvidia-smi in these logs.

@scepterus
Copy link
Author

After the modprobe and the pull request merge, still the same issue.
I get the CUDA info at the start, but the error still shows up.
Don't we need CUDA drivers as well? And I really think we need the latest CUDA image, because the info of my CUDA is 12.2 and the one in the container is 12.1 and is deprecated.

@derneuere
Copy link
Member

Should be backwards compatible and we need this version as pytorch has the same version. I also have a 1050ti with CUDA 12.2, drivers with the version 535.129.03 and it works.

CUDA drivers should be installed on the host system. The docker image needs the image from nvidia, which we already use. Can you check if there are different drivers available for your system?

My system:

| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8              N/A / ERR! |     96MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

My env looks like this:

  backend:
    image: reallibrephotos/librephotos-gpu:${tag}
    container_name: backend
    restart: unless-stopped
    volumes:
      - ${scanDirectory}:/data
      - ${data}/protected_media:/protected_media
      - ${data}/logs:/logs
      - ${data}/cache:/root/.cache
    environment:
      - SECRET_KEY=${shhhhKey:-}
      - BACKEND_HOST=backend
      - ADMIN_EMAIL=${adminEmail:-}
      - ADMIN_USERNAME=${userName:-}
      - ADMIN_PASSWORD=${userPass:-}
      - DB_BACKEND=postgresql
      - DB_NAME=${dbName}
      - DB_USER=${dbUser}
      - DB_PASS=${dbPass}
      - DB_HOST=${dbHost}
      - DB_PORT=5432
      - MAPBOX_API_KEY=${mapApiKey:-}
      - WEB_CONCURRENCY=${gunniWorkers:-1}
      - SKIP_PATTERNS=${skipPatterns:-}
      - ALLOW_UPLOAD=${allowUpload:-false}
      - CSRF_TRUSTED_ORIGINS=${csrfTrustedOrigins:-}
      - DEBUG=0
      - HEAVYWEIGHT_PROCESS=${HEAVYWEIGHT_PROCESS:-}
    depends_on:
      db:
        condition: service_healthy
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

I added export CUDA_VISIBLE_DEVICES=0 to the entrypoint.sh, maybe that will make a difference.

@scepterus
Copy link
Author

Here's my output inside the container:

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Fri Dec 22 00:31:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:08:00.0 Off |                  N/A |
|  0%   41C    P0              N/A /  70W |      0MiB /  2048MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I added export CUDA_VISIBLE_DEVICES=0

That would just mean no devices would be registered.

@scepterus
Copy link
Author

image
Yours even has an error in detecting the Watt limit. Is that from inside the container or from the host system?

@derneuere
Copy link
Member

export CUDA_VISIBLE_DEVICES=0 means, that the 0th devices will be visible, which is in your list your only GPU.
Yeah, probably has something to do with it being a laptop, but it still works :)

@scepterus
Copy link
Author

scepterus commented Dec 22, 2023

that the 0th devices will be visible

The naming is a bit confusing then.

but it still works :)

Question is, maybe it's such a unique case that it works when you test, but with a desktop one it requires something different?

@scepterus
Copy link
Author

Added it manually on my entrypoint, did not help.
I made a quick script to check the host:
LibrePhotos/librephotos-docker#115
My host passes all the checks.

@derneuere
Copy link
Member

I used your configuration for the GPU. This also works on my machine.

runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

I also executed your HostCuda script and it passed:

CUDA-capable GPU detected.
x86_64
Linux version is supported.
GCC is installed.
Kernel headers and development packages are installed.
NVIDIA binary GPU driver is installed.
Docker is installed.
Unable to find image 'nvidia/cuda:12.2.2-runtime-ubuntu20.04' locally
12.2.2-runtime-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Pull complete 
db26cf78ae4f: Pull complete 
5adc7ab504d3: Pull complete 
e4f230263527: Pull complete 
95e3f492d47e: Pull complete 
35dd1979297e: Pull complete 
39a2c88664b3: Pull complete 
d8f6b6cd09da: Pull complete 
fe19bbed4a4a: Pull complete 
Digest: sha256:7df325b76ef5087ac512a6128e366b7043ad8db6388c19f81944a28cd4157368
Status: Downloaded newer image for nvidia/cuda:12.2.2-runtime-ubuntu20.04
NVIDIA Container Toolkit is installed.

Can you try this suggested fix on your host machine? pytorch/pytorch#49081 (comment)

@scepterus
Copy link
Author

Do you mean this:

sudo modprobe -r nvidia_uvm && sudo modprobe nvidia_uvm

Because that returns:
modprobe: FATAL: Module nvidia_uvm not found.

So my host is set up like yours (at least from the prerequisites tests) and the compose file is the same. What else could be different?
nvidia_uvm is not one of the prerequisites from the nvidia documentation.

@derneuere
Copy link
Member

Alright, just execute the second part, sudo modprobe nvidia_uvm, the first part is just for removing an already existing nvidia_uvm module.

I am not basing the debug commands from the documentation as it is usually not complete, but from the issue on github in pytorch, which usually provides better pointers on how to fix the error.

I just use kubuntu 22.04, do you use something unique like arch?

@scepterus
Copy link
Author

scepterus commented Dec 23, 2023

sudo modprobe nvidia_uvm
modprobe: FATAL: Module nvidia_uvm not found in directory /lib/modules/6.1.55-production+truenas

I just use kubuntu 22.04, do you use something unique like arch?

Nope, Debian bookworm.

@derneuere
Copy link
Member

This sounds like the gpu drivers are not actually correctly installed according to stackoverflow: https://askubuntu.com/questions/1413512/syslog-error-modprobe-fatal-module-nvidia-not-found-in-directory-lib-module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants