Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #44

Open
McC0dy opened this issue Jun 14, 2019 · 2 comments
Open

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #44

McC0dy opened this issue Jun 14, 2019 · 2 comments

Comments

@McC0dy
Copy link

McC0dy commented Jun 14, 2019

When training using any of the example configurations from the documentation I get the error:
"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

Reproducing
For example running:
python main.py --network_type rnn --dataset wikitext

My system configuration
CUDA 10.1
Python 3.7.3
PyTorch 1.1.0
Arch Linux
GPU: RTX 2070

Other PyTorch applications work just fine.

Full output (from pipenv environment):

% python main.py --network_type rnn --dataset wikitext                                                                    oliver@oliver
2019-06-14 16:30:31,585:INFO::[*] Make directories : logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:49,909:INFO::regularizing:
2019-06-14 16:30:54,743:INFO::# of parameters: 169,315,278
2019-06-14 16:30:54,834:INFO::[*] MODEL dir: logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:54,834:INFO::[*] PARAM path: logs/wikitext_2019-06-14_16-30-31/params.json
Traceback (most recent call last):
  File "main.py", line 54, in <module>
    main(args)
  File "main.py", line 34, in main
    trnr.train()
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 222, in train
    self.train_shared(dag=dag)
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 305, in train_shared
    dags)
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 251, in get_loss
    output, hidden, extra_out = self.shared(inputs, dag, hidden=hidden)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 235, in forward
    logit, hidden = self.cell(x_t, hidden, dag)
  File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 354, in cell
    output = self.batch_norm(output)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Debugging
Debugging the parameters passed to batch_norm I found that the following parameters are all on cuda-device: input, weight, bias, running_mean, running_var. Which is all reasonable.
The remaining vars are reasonable as well.

@lorenzoviva
Copy link

Had same problem, the pytorch most widely used for NAS-related github repositories is 0.3.1 sometimes 0.2. I suggest you to try a downgrade.

@carpedm20
Copy link
Owner

I think you should use v0.3.1 (links) which was released on Feb 13, 2018 because my initial commit was on Feb 14, 2018.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants