Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOVA] Flaky conda GPU builds during Initialize Containers step #5191

Closed
atalman opened this issue May 9, 2024 · 0 comments · Fixed by #5243
Closed

[NOVA] Flaky conda GPU builds during Initialize Containers step #5191

atalman opened this issue May 9, 2024 · 0 comments · Fixed by #5243
Assignees

Comments

@atalman
Copy link
Contributor

atalman commented May 9, 2024

Nova GPU based workflows are failing in Conda. Happening mostly with 12.4, however can be observed with 12.1 and 11.8.
The failure is flaky since it passes from time to time.

Issue starting the Container:

Status: Downloaded newer image for pytorch/conda-builder:cuda12.4
  docker.io/pytorch/conda-builder:cuda12.4
  /usr/bin/docker create --name 42fc3baf03494dc5b4f0bc0b1e8e1dc4_pytorchcondabuildercuda124_f46ca7 --label 9f63b4 --workdir /__w/audio/audio --network github_network_68d0125cf865468b98e48e08a98dd61d --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/ec2-user/actions-runner/_work":"/__w" -v "/home/ec2-user/actions-runner/externals":"/__e":ro -v "/home/ec2-user/actions-runner/_work/_temp":"/__w/_temp" -v "/home/ec2-user/actions-runner/_work/_actions":"/__w/_actions" -v "/home/ec2-user/actions-runner/_work/_tool":"/__w/_tool" -v "/home/ec2-user/actions-runner/_work/_temp/_github_home":"/github/home" -v "/home/ec2-user/actions-runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" pytorch/conda-builder:cuda12.4 "-f" "/dev/null"
  f9e4cf858076e4a6ba5faeaa81174b7d4398938049e3afb4add73fff065874d6
  /usr/bin/docker start f9e4cf858076e4a6ba5faeaa81174b7d4398938049e3afb4add73fff065874d6
  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.4, please update your driver to a newer version, or use an earlier cuda container: unknown
  Error: failed to start containers: f9e4cf858076e4a6ba5faeaa81174b7d4398938049e3afb4add73fff065874d6
  Error: Docker start fail with exit code 1

Audio: https://github.com/pytorch/audio/actions/runs/9016656951/job/24773679889
Vision: https://github.com/pytorch/vision/actions/runs/9001122121/job/24726698313

This is maybe related to changes in:

docker.io/pytorch/conda-builder:cuda12.1

and

docker.io/pytorch/conda-builder:cuda12.4
@atalman atalman changed the title [NOVA] Flaky torchaudio and torchvision conda GPU builds are faling [NOVA] Flaky torchaudio and torchvision conda GPU builds are failing during Initialize Containers step May 9, 2024
@atalman atalman changed the title [NOVA] Flaky torchaudio and torchvision conda GPU builds are failing during Initialize Containers step [NOVA] Flaky conda GPU builds during Initialize Containers step May 9, 2024
@atalman atalman self-assigned this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant