-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loader bottlenecking training #51
Comments
Hi @JakobLindscheid ! Thanks for the detailed issue! I was not aware of this issue as I never checked the data loading speed in my experiments. Can I check this on my end and get back to you soon? |
Sure, thank you for having a look! |
That's useful to know, thanks for sharing. |
Hi,
Thank you for publishing the pretraining and finetuning scripts! They are really helpful.
For a university project, we are trying to reproduce the results from the paper. However, running the pretrain script, we observe very slow training speeds (~1 minute per epoch) on our hardware.
Running the pytorch profiler for 16 training batches, we see the following:
FIT Profiler Report (relevant lines)
Apparently the data loader needs 5 seconds for each batch, which is 84% of the full time of the training step.
After some further investigation, we found that the train data loader does the following:
This means a full timeseries gets transformed and then most of the transformed data is not used. This is then done for each item in a batch. We observed ~10 ms for transforming a full timeseries and with a batch size of 512, we get the >5 seconds reported by the profiler.
The order of execution is partly given by the gluonts package. So I am not aware of an obvious solution without addressing it there.
Now my question. Did you face the same issue during your experiments? How can we solve the problem we observe?
The text was updated successfully, but these errors were encountered: