-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in negative binomial from 0.12 to 0.13 and onwards (at least for DeepAR in PyTorch) #3129
Comments
@kashif @lostella I mentioned this to you some time ago and @jgasthaus FYI |
@timoschowski inspecting the diff, one thing that changed is the dependency on PyTorch Lightning from 1.5 to >= 1.5. It seems like 1.7 introduced the MPS backend https://lightning.ai/pages/community/lightning-releases/pytorch-lightning-1-7-release/ which is one thing that might be causing trouble. What version of lightning do you use? Two options to check if this MPS thing is to be blamed:
I don’t see other changes between the versions that could explain this. |
thanks @lostella, you're a wizard. I have
when I do the resulting output is still this: however, when running the notebook with results are like this for neg binomial, so indeed improved: and performance is inline also after more epochs (500 here for v0.13 with lightning 1.5) compare with (500 here for v0.12 with lightning 1.5) For the moment I have a workaround by pinning the lightning version, so that's great. Huge thanks. A couple of interesting things remain:
Of course the overall performance isn't there yet (eg peaks aren't aligned), but this is because I don't have any dynamic features included, will bring that back next. |
adding some thoughts here. After a suggestion by @kashif I also tried running the notebook with
which gives me torch version: however results are the same. |
one thing I noted is that changing |
No, this is an issue; we'll have to figure out what's wrong with recent lightning versions and make sure that everything runs smoothly. Also, given that setting
I don't think so: this is the history of changes, and @kashif's change is the only thing that happened. It's #2749 as was part of 0.13.0 already. It really seems like something weird is going on with training. |
Ok, I have troubles running the notebook in colab. @kashif is this something that you could take a look at? this is about loading the models, something seems to be broken there.... |
Description
When loading the FOOD_3 subset of the M5 competition (cut to only data in 2016), I noticed that the performance of the negative binomial distribution changes, at least in DeepAR.
I suspect that something with the scaling is broken, but I haven't been able to pin it down unfortunately. When I look at the comparision between the two versions I can't really tell whether the negative binomial distribution changed.
Any help here is appreciated. The problem continues up to the current version.
To Reproduce
The example isn't really minimal, it's based on a notebook of the M5 data set available in colab (BUT IT DOESN'T RUN IN COLAB because of a lightning error with colab; locally it runs):
https://drive.google.com/file/d/1OOv_I7aAStgHW5iFuuKKB5r0qUW8BLxo/view?usp=sharing
Error message or code output
(Paste the complete error message, including stack trace, or the undesired output that the above snippet produces.)
The output shown here is after 15 epochs of training DeepAR on bespoke data where I've aggregated all forecasts and plot them against aggregated actuals. In v0.12 this produced an ok result after 15 epochs (it improves considerably with more epochs):
wheras in v0.13 this produces
Note however that there are small differences. The number of parameters is 76.3 K in v0.13 and in v0.12
The dataset that I'm loading is a pickled dataset which includes dynamic features (but I think I'm ignoring them in both v0.12 and v0.13)
Environment
The text was updated successfully, but these errors were encountered: