Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in negative binomial from 0.12 to 0.13 and onwards (at least for DeepAR in PyTorch) #3129

Open
timoschowski opened this issue Feb 17, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@timoschowski
Copy link
Contributor

timoschowski commented Feb 17, 2024

Description

When loading the FOOD_3 subset of the M5 competition (cut to only data in 2016), I noticed that the performance of the negative binomial distribution changes, at least in DeepAR.

I suspect that something with the scaling is broken, but I haven't been able to pin it down unfortunately. When I look at the comparision between the two versions I can't really tell whether the negative binomial distribution changed.

Any help here is appreciated. The problem continues up to the current version.

To Reproduce

The example isn't really minimal, it's based on a notebook of the M5 data set available in colab (BUT IT DOESN'T RUN IN COLAB because of a lightning error with colab; locally it runs):

https://drive.google.com/file/d/1OOv_I7aAStgHW5iFuuKKB5r0qUW8BLxo/view?usp=sharing

Error message or code output

(Paste the complete error message, including stack trace, or the undesired output that the above snippet produces.)

The output shown here is after 15 epochs of training DeepAR on bespoke data where I've aggregated all forecasts and plot them against aggregated actuals. In v0.12 this produced an ok result after 15 epochs (it improves considerably with more epochs):

M5_FOOD_3_v0 12

wheras in v0.13 this produces
M5_FOOD_3_v0 13

Note however that there are small differences. The number of parameters is 76.3 K in v0.13 and in v0.12

The dataset that I'm loading is a pickled dataset which includes dynamic features (but I think I'm ignoring them in both v0.12 and v0.13)

Environment

  • Operating system: Mac OSX Monterey, ARM chip
  • Python version: 3.9.9
  • GluonTS version: 0.12 vs 0.13 and upwards
  • MXNet version: NA (this is for PyTorch, v 1.13.1)
@timoschowski timoschowski added the bug Something isn't working label Feb 17, 2024
@timoschowski
Copy link
Contributor Author

@kashif @lostella I mentioned this to you some time ago and @jgasthaus FYI

@lostella
Copy link
Contributor

lostella commented Feb 18, 2024

@timoschowski inspecting the diff, one thing that changed is the dependency on PyTorch Lightning from 1.5 to >= 1.5. It seems like 1.7 introduced the MPS backend https://lightning.ai/pages/community/lightning-releases/pytorch-lightning-1-7-release/ which is one thing that might be causing trouble.

What version of lightning do you use?

Two options to check if this MPS thing is to be blamed:

  1. pin lightning to 1.5 and see if it works better
  2. on whatever version of lightning you have, set trainer_kwargs = dict(accelerator=“cpu”) when constructing the estimator, see if it’s better

I don’t see other changes between the versions that could explain this.

@timoschowski
Copy link
Contributor Author

timoschowski commented Feb 18, 2024

thanks @lostella, you're a wizard.

I have

import pytorch_lightning as pl
pl.__version__
'1.9.5'

when I do
"accelerator": "cpu"

the resulting output is still this:
M5_FOOD_3_v0 12_cpu_accelerator

however, when running the notebook with
!pip install -U "gluonts[torch]==0.13.0" matplotlib orjson tensorboard optuna datasets "pytorch-lightning==1.5"

results are like this for neg binomial, so indeed improved:
M5_FOOD_3_v0 13_lighting1 5_15epochs

and performance is inline also after more epochs (500 here for v0.13 with lightning 1.5)
M5_FOOD_3_v0 13_lighting1 5_500epochs

compare with (500 here for v0.12 with lightning 1.5)
M5_FOOD_3_v0 12_lighting1 5_500epochs

For the moment I have a workaround by pinning the lightning version, so that's great. Huge thanks.

A couple of interesting things remain:

  • for v0.14 of GluonTS, a lighting version larger than 1.5 is required, so I'm stuck on v0.13... any idea here?
  • One thing that stands out for me is that all the distribution code shifted around, and imports are different. Did we change anything with the neg binomial implementation? Performance with student_t is exactly the same between v0.12 and v0.13 independent of lightning, so I find that curious. It doesn't really show up on the diff, so I'm wondering if you had any intuition here (I remember discussions with @kashif about this in the past)
  • why doesn't the notebook work on collab? Seems like the model loading doesn't work.

Of course the overall performance isn't there yet (eg peaks aren't aligned), but this is because I don't have any dynamic features included, will bring that back next.

@timoschowski
Copy link
Contributor Author

adding some thoughts here. After a suggestion by @kashif I also tried running the notebook with

!pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

which gives me torch version:
'2.3.0.dev20240219'

however results are the same.

@timoschowski
Copy link
Contributor Author

one thing I noted is that changing context_length from the default prediction_length to 2*prediction_length has a substantial benefit here....

@lostella
Copy link
Contributor

for v0.14 of GluonTS, a lighting version larger than 1.5 is required, so I'm stuck on v0.13... any idea here?

No, this is an issue; we'll have to figure out what's wrong with recent lightning versions and make sure that everything runs smoothly. Also, given that setting accelerator="cpu" did not work makes me think this may not be a problem on Apple silicon only? Running the same on Linux with recent versions of lightning would answer that

Did we change anything with the neg binomial implementation?

I don't think so: this is the history of changes, and @kashif's change is the only thing that happened. It's #2749 as was part of 0.13.0 already. It really seems like something weird is going on with training.

@timoschowski
Copy link
Contributor Author

Ok, I have troubles running the notebook in colab. @kashif is this something that you could take a look at? this is about loading the models, something seems to be broken there....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants