Data loader bottlenecking training #51

JakobLindscheid · 2024-04-18T19:31:01Z

Hi,
Thank you for publishing the pretraining and finetuning scripts! They are really helpful.
For a university project, we are trying to reproduce the results from the paper. However, running the pretrain script, we observe very slow training speeds (~1 minute per epoch) on our hardware.
Running the pytorch profiler for 16 training batches, we see the following:

FIT Profiler Report (relevant lines)

Action	Mean duration (s)	Num calls	Total time (s)	Percentage %
Total	-	1397	99.734	100 %
run_training_epoch	91.657	1	91.657	91.901
[_TrainingEpochLoop].train_dataloader_next	5.2931	16	84.689	84.915
[_EvaluationLoop].val_next	0.246	19	4.674	4.6865
[LightningModule]LagLlamaLightningModule.optimizer_step	0.10731	16	1.717	1.7216
run_training_batch	0.10731	16	1.717	1.7216
[Strategy]SingleDeviceStrategy.training_step	0.091875	16	1.47	1.4739
[Strategy]SingleDeviceStrategy.validation_step	0.044368	19	0.843	0.84525
[Strategy]SingleDeviceStrategy.backward	0.0135	16	0.216	0.21658
[Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_train_epoch_end	0.141	1	0.141	0.14138
[Callback]ModelCheckpoint{'monitor': 'val_loss', 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_train_epoch_end	0.093	1	0.093	0.093248
[LightningModule]LagLlamaLightningModule.transfer_batch_to_device	0.0022286	35	0.078	0.078208
[Strategy]SingleDeviceStrategy.batch_to_device	0.0022286	35	0.078	0.078208
[LightningModule]LagLlamaLightningModule.on_validation_model_train	0.008	2	0.016	0.016043
[Callback]ModelSummary.on_fit_start	0.015	1	0.015	0.01504
[Callback]TQDMProgressBar.on_validation_batch_end	0.00078947	19	0.015	0.01504
[LightningModule]LagLlamaLightningModule.optimizer_zero_grad	0.0009375	16	0.015	0.01504

Apparently the data loader needs 5 seconds for each batch, which is 84% of the full time of the training step.
After some further investigation, we found that the train data loader does the following:

Apply the transformation to a full time series.
Sample a window from the transformed data (inside the InstanceSplitter).
Extract the window from the transformed data (InstanceSplitter).
Create the batches of data according to the batch size.

This means a full timeseries gets transformed and then most of the transformed data is not used. This is then done for each item in a batch. We observed ~10 ms for transforming a full timeseries and with a batch size of 512, we get the >5 seconds reported by the profiler.

The order of execution is partly given by the gluonts package. So I am not aware of an obvious solution without addressing it there.

Now my question. Did you face the same issue during your experiments? How can we solve the problem we observe?

ashok-arjun · 2024-04-21T23:37:43Z

Hi @JakobLindscheid !

Thanks for the detailed issue!

I was not aware of this issue as I never checked the data loading speed in my experiments.

Can I check this on my end and get back to you soon?

JakobLindscheid · 2024-04-22T11:28:10Z

Sure, thank you for having a look!
For now, I added a data = list(data) before the instance splitter is applied. This forces the transformation to be done before training starts. Obviously it's not the nicest solution ever since it takes a few minutes before training starts, but the total training time is improved a lot.

ashok-arjun · 2024-04-22T19:10:10Z

That's useful to know, thanks for sharing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loader bottlenecking training #51

Data loader bottlenecking training #51

JakobLindscheid commented Apr 18, 2024

ashok-arjun commented Apr 21, 2024

JakobLindscheid commented Apr 22, 2024

ashok-arjun commented Apr 22, 2024

Data loader bottlenecking training #51

Data loader bottlenecking training #51

Comments

JakobLindscheid commented Apr 18, 2024

ashok-arjun commented Apr 21, 2024

JakobLindscheid commented Apr 22, 2024

ashok-arjun commented Apr 22, 2024