-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan in SRU output #185
Comments
hi @ksopyla Thank you for trying SRU.
Re: speed up. It is hard to diagnose the speed without knowing what the task is and how the training is implemented. Usually SRU would run significantly faster if each forward() call takes multiple tokens and multiple sequences at once (instead of giving 1 token per sequence each forward() call). |
thanks @taoleicn for your answer and for the great work ! |
FYI, I switched to |
hi @nicolaspanel , Thank you! |
After more investigations,
off course 👍
I'm using SRU with
Yes, using build-in
Since yesterday I have found a way to reproduce and investigate by generating a model checkpoint as soon loss contains a Point of Interest 1: only one sample has Point of interest 2: checkpoint parameters are fine (no infinite/nan values) When I execute the following code:
Point of interest 3: Point of interest 4: all outputs are finite if Point of interest 5: no Point of interest 6: hidden features values have some very large values (+100, +200, -95, etc…), especially in the last layer Possible workaround: I will try to clip values to stay in range [-10; 10] Hope this help |
Thank you @nicolaspanel ! Would disabling I've lately found Re: point of interest 5. Do you mean training works fine if layer norm is not used? Do you have zero-vector inputs in |
👍
I ran experiments with
I didn't try with BTW I did some experiments clipping values to make sure they stay in range [-10, 10]. It works just fine (so far at least)
NO, I tried disabling The fact that I do not have
Yes |
hi @nicolaspanel
Are you using
There is also an
Is the horizontal axis the index of the layer? (in the picture you sent earlier) Does it show that |
I use https://github.com/LiyuanLucasLiu/RAdam
I haven't used value clipping before neither but it this case it helped (no more
I will stick to value clipping because the only goal is to prevent overflows in mixed precision setup, not adding an extra activation.
No. The horizontal axis is the «feature» axis. Picture displays network's
No I don't think so @ksopyla is it possible for you to check your network's intermediate representations to see if they contain very high absolute values (greater than 100 or lower than 100 for example)? Best regards |
@nicolaspanel How can you run SRU with Cuda 11.1! I also have RTX3090 but I reach error (--generate-dependencies-with-compile), It seems sru_cuda_kernel.cu just run in CUDA10.2, Do you have the same problem, Can you share how you fix it ! |
I try to train a RNN network (seq2seq) with GRU and SRU cells. When training is done with GRU everything is ok, loss is decreasing an accuracy steadily rise. But when switch to GRU after few hours i got NAN in loss and norm of network params (hidden states, weight matrices) is nan.
I use https://github.com/asappresearch/sru/tree/3.0.0-dev branch and perform training with 2 geforce 3090 with CUDA 11.1, pytorch 1.8 and pytorch-lightning.
Could you point me the direction how to diagnose those bugs? It is bug in sru it self, GPU related or CUDA?
SRU is defined
I have also noticed lack of speedup relative to GRU, GRU is even faster ( 2.6it/s vs 1.4it/s)
The text was updated successfully, but these errors were encountered: