Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device permissions set by the device-plugin cause unexpected access() syscall responses, ending up in Pytorch failures #65

Closed
elukey opened this issue May 15, 2024 · 8 comments

Comments

@elukey
Copy link

elukey commented May 15, 2024

Problem Description

Hi folks!

The Wikimedia foundation has been working for a long time with AMD GPUs, and we are now experimenting their usage with Kubernetes (since we run KServe as platform for ML model inference). In https://phabricator.wikimedia.org/T362984 we tried to figure out why Pytorch 2.1+ (ROCm variant) showed the following failures when initializing the GPU from Python:

>>> import torch
>>> torch.cuda.is_available()
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
False

The issue didn't reproduce on Pytorch 2.0 with ROCm 5.4. We used strace to get more info and the following popped up:

access("/dev/dri/renderD128", F_OK)     = -1 EPERM (Operation not permitted)

The above syscall is issued only with Pytorch 2.1+ (ROCm variant), and not on previous versions. We checked file/path/directory/etc.. permissions, but everything checked out. We allow "other" to read/write the render and kfd devices, see #39 for more info.

After a lot of tests, it seems that the culprit is related to the device permissions assigned by the device plugin to the devices exposed to the containers. At the moment it is rw (that should be set in this line) but Docker by default sets rwm, allowing the mknod (the only reference that we found is https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt but we use Cgroups v2).

We tried to run docker directly on the k8s node, and if we use something like --device /dev/dri/renderD128:/dev/dri/renderD128:rw the access failure can be reproduced, meanwhile with --device /dev/dri/renderD128:/dev/dri/renderD128:rwm the access syscall works.

Interesting thing: access fails with EPERM only using the F_OK arg, meanwhile it returns consistent results for the rest (R_OK, W_OK, X_OK).

We are still not sure why allowing mknod makes the access syscall work with F_OK, but it would be good to start a discussion in here since Pytorch is a very big use case and probably more people will report the problem in the future.

We also tried Pytorch 2.3 with ROCm 6.0, same issue.

We also use seccomp and AppArmor default profiles for all our containers, but we ruled out their involvement (at least there is no indication that they are playing any role in this). We also run containers dropping capabilities.

The AMD k8s-device-plugin runs as standalone daemon on the k8s worker/node, not as DaemonSet.

Operating System

Debian Bullseye

CPU

Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@elukey
Copy link
Author

elukey commented May 15, 2024

From https://docs.kernel.org/admin-guide/cgroup-v2.html#device-controller I read the following:

Device controller manages access to device files. It includes both creation of new device files (using mknod), and access to the existing device files.
Cgroup v2 device controller has no interface files and is implemented on top of cgroup BPF. To control access to device files, a user may create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach them to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a device file, corresponding BPF programs will be executed, and depending on the return value the attempt will succeed or fail with -EPERM.

So probably it is the EBPF program associated with the device in the cgroup that causes the return value for access? The only explanation that I can give is that access(F_OK) makes a mknod request behind the scenes, triggering the EPERM.

@y2kenny
Copy link
Contributor

y2kenny commented May 15, 2024

@elukey, thanks for the detailed investigation. Were you able to reproduce the same permission issue if you try using supplementalGroups (with render group) per #39 suggestion?

@elukey
Copy link
Author

elukey commented May 16, 2024

@elukey, thanks for the detailed investigation. Were you able to reproduce the same permission issue if you try using supplementalGroups (with render group) per #39 suggestion?

Hi @y2kenny! Thanks for following up! I didn't try to apply the suggestion since we added a special udev rule to allow rw for other in the GPU devices permissions (both kfd and renderDXXX). We use this config only on k8s workers (where we don't have multiple users connecting and sharing the system, but only containers running) and it worked great so far. It seems unrelated to the current problem though, lemme know if you think otherwise and I'll try to work on it.

@elukey
Copy link
Author

elukey commented May 16, 2024

This is the current set of permissions as seen inside a container with a GPU mounted on it:

$ kubectl exec nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx -n experimental  -- ls -l /dev/dri
total 0
crw-rw---- 1 root video 226,   1 May 14 16:13 card1
crw-rw-rw- 1 root   106 226, 128 May 14 16:13 renderD128

$ kubectl exec nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx -n experimental  -- ls -l /dev/kfd
crw-rw-rw- 1 root 106 242, 0 May 14 16:13 /dev/kfd

In this case render doesn't really count, at least IIUC.

@elukey
Copy link
Author

elukey commented May 16, 2024

I think that this is the issue:

opencontainers/runc@81707ab
opencontainers/runc@efb8552

So in cgroups v2 the device perms are checked via eBPF, and the program is attached to the process/container by runc. The Debian Bullseye version of runc is 1.0.0~rc93, meanwhile from the git tags I see that 1.0.0~94 is the first one with the commit. I'll try to verify this and report back, hopefully no change is needed for the device plugin!

@elukey
Copy link
Author

elukey commented May 20, 2024

Opened https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071269 to ask Debian to include the above patches for Bullseye's runc. We should probably alert folks in the README or similar that only runc >=1.0.0~rc94 works with this plugin.

@elukey
Copy link
Author

elukey commented Jun 12, 2024

@y2kenny I confirm that the issue was the runc version, everything works fine now :)

@elukey elukey closed this as completed Jun 12, 2024
@y2kenny
Copy link
Contributor

y2kenny commented Jun 12, 2024

Much appreciated. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants