Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_tacotron.py: Random CUBLAS_STATUS_INTERNAL_ERROR #216

Open
serg06 opened this issue Nov 6, 2020 · 5 comments
Open

train_tacotron.py: Random CUBLAS_STATUS_INTERNAL_ERROR #216

serg06 opened this issue Nov 6, 2020 · 5 comments

Comments

@serg06
Copy link

serg06 commented Nov 6, 2020

Occasionally when training tacotron (train_tacotron.py), CUDA throws an error and kills the training.

| Epoch: 167/1630 (15/45) | Loss: 0.3459 | 1.1 steps/s | Step: 284k |
Traceback (most recent call last):
  File "train_tacotron.py", line 204, in <module>
    main()
  File "train_tacotron.py", line 100, in main
    tts_train_loop(paths, model, optimizer, train_set, lr, training_steps, attn_example)
  File "train_tacotron.py", line 144, in tts_train_loop
    loss.backward()
  File "C:\Python37\lib\site-packages\torch\tensor.py", line 227, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Python37\lib\site-packages\torch\autograd\__init__.py", line 138, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I don't know why this happens, it seems almost random. Sometimes it happens 12 hours after starting, sometimes it happens 15 minutes after starting.

@eonglints
Copy link

eonglints commented Nov 30, 2020

I'm seeing the same thing. Did you find a fix? Also, are you able to pick up training from where you left off?
I'm on CUDA 10.1, Windows 10, Pytorch 1.7

@serg06
Copy link
Author

serg06 commented Nov 30, 2020

I'm seeing the same thing. Did you find a fix?

I didn't find a fix but I did find a workaround: Automatically restarting after a crash.

train.bat:

:loop
python train_tacotron.py
echo Crash detected, restarting...
timeout /t 5 /nobreak
goto loop

Also, are you able to pick up training from where you left off?

Yep, it always restarts from the latest step for me.

@eonglints
Copy link

Nice, that's a good solution. And yeah, I found out this morning that it picks up from where it left off very well. How many epochs did you leave it to train for? I'm on 100k so far and will probably let it run until close to a million I guess.

@serg06
Copy link
Author

serg06 commented Nov 30, 2020

Nice, that's a good solution. And yeah, I found out this morning that it picks up from where it left off very well. How many epochs did you leave it to train for? I'm on 100k so far and will probably let it run until close to a million I guess.

I just followed in this guy's steps and fine-tuned the pre-trained model on my own data. I tried going up to 300k, but I found it starts getting worse after ~260k. I don't think I ever tried training it from scratch.

1 million epochs? Wow, that would take quite a while on my hardware. Can I ask what GPU you're using and how fast your training goes?

@eonglints
Copy link

Thanks. I was going to try fine-tuning but I have a 17.5 hour dataset so thought I would just train from scratch as it's not too much smaller than the LJ speech dataset. I'm on 102k steps and have been training off and on for the last 8 hours. However, there's been a fair bit of downtime messing with batch sizes to try and avoid memory crashes, so really only around 5-6 hours. I'm using a 2080 Super and with a batch size of 32 I'm getting around 4-5 steps/second.
Also, turns out I'm on Cuda 11.1 actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants