Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes during memory use reduction #5

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Notes during memory use reduction #5

wants to merge 2 commits into from

Conversation

xloem
Copy link

@xloem xloem commented Dec 19, 2022

Hi,

For fun, kind of, I'm poking at reducing the memory usage of the standalone example, so more of it can run on my lower end system.

I've only skimmed the paper so far, but while looking at the code I'm encountering some small confusions or questions regarding the implementation, so I'm opening this pull request to connect a little bit.

I'll answer these questions myself when and if I find the answers.

Questions:

  • I noticed the use of the kernel norm is detached from the gradient graph and cached between runs. How come this doesn't result in a disparate kernel norm as the kernels are updating during training?

  • I noticed the use of F.interpolate does not specify align_corners, leaving it to default to False, which it looks like to me can leave some flat artefacts at the edges, and stretch the content between them by a subsample, when the interpolation is linear. Does this matter? My intuition would have been to do linear interpolation by dropping the last sample or wrapping to the first. In my changes, I had to add a constant of 1 to the input size to get the same interpolation output for truncated kernels.

  • Why is it helpful to scale the weights of the kernels by their distance? Wouldn't the training process learn this scaling itself?

  • Small changes to the backend can result in small (on the order of 1/100th) changes to outputs unless a lot of care is taken. How important is that kind of numerical stability?

Resolved:

  • I understand that n=2*L in the fft is to avoid performing the circular convolution. (this took me some learning)

Idea:
Given the principles of this algorithm, it looks to me like it might be possible to run using very minimal ram, by using the fft to perform the kernel interpolation in frequency space, and streaming the convolution.

@xloem xloem marked this pull request as ready for review December 20, 2022 01:32
@xloem
Copy link
Author

xloem commented Dec 20, 2022

This is what I got done today. I'm kind of a crazy flake, so who knows what tomorrow holds, but here are some changes for less ram usage if they are of interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant