Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Improve caching #3211

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

wolfgangwalther
Copy link
Member

@wolfgangwalther wolfgangwalther commented Feb 6, 2024

Plenty of text in the commit messages. Hopefully this also fixes the currently failing loadtest, but I'm not sure about that.


TODO:

  • Save nix cache only on main
  • Split stack/cabal caches into dependency and working directory caches
  • Reduce cache size
  • Add PR caching for postgrest-build
  • Garbage collect nix/stack/cabal caches?
  • Purge old caches proactively when saving new cache?

@wolfgangwalther
Copy link
Member Author

wolfgangwalther commented Feb 6, 2024

Indeed, loadtest is fixed that way. Not sure whether that's a one-time thing or not, but still a good sign.

A few more observations, looking at https://github.com/PostgREST/postgrest/actions/caches. Since I'm not sure whether that's publicly available, some key facts:

  • There's a big warning on top, that reads: "Approaching total cache storage limit (53.35 GB of 10 GB Used). Least recently used caches will be automatically evicted to limit the total cache storage to 10 GB."
  • Those 53 GB have been created since yesterday, so that's not going to be sustainable.
  • macOS-all: 3.1 GB
  • Linux-test: 2.7 GB
  • Linux-style: 1.4 GB
  • Linux-common: 2.7 GB
  • Stack/cabal caches between 300MB and 1GB each

test and style are the new split in this MR. common is the old cache from before. Clearly, splitting test and style does not make any difference in terms of cache size - everything in style is also in test. On the flip-side, merging the loadtest into it is great, the cache is not really bigger.

Overall, we only have 10 GB, so we need to use the cache a lot less. I think we should do the following:

  • Only store caches on the main branch.
  • Restore caches on main and in PRs.
  • The caches will have fixed names and will be replaced by a new commit on main. No prefix matching.
  • Merge style and test again.

This way we can use the 10 GB we have for:

  • 3.1 GB for macos
  • 2.7 GB for tools
  • 2.3 GB for stack
  • 1 GB for Cabal
  • Sum: 9.1 GB

By only storing caches when on main we avoid having the main cache be evicted when a PR changes dependencies. This will make CI fast for those PRs which only touch code, those will use the cache. PRs which touch dependencies or nix will be slower, because they only use cachix. This seems like the most useful thing to do here.

Edit: The above calculation does not take the "static" cache into account, which is another 1 GB. So the cache would already be at it's maximum with those...


To avoid having to rebuild the dynamic postgrest build in every nix job, before we run all the tests on it, we can additionally cache just the dist-newstyle folder between the "prepopulate job" and all the test jobs. This comes in at around 200 MB, so we should still have room for those, even when pushing them in PRs. We'd just need to make sure that those caches get some kind of priority over the main caches in terms of eviction. I.e. caches from main should always be evicted last, PR cached first. Will need to figure out whether that's possible.

Copy link
Collaborator

@develop7 develop7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to be taken care of, LGTM!

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

we can additionally cache just the dist-newstyle folder between the "prepopulate job" and all the test jobs

Ooh, this one's good. Should've thought of this myself. Looks like something action/cache can easily handle.

We'd just need to make sure that those caches get some kind of priority over the main caches in terms of eviction.

The current eviction policy is that the oldest caches are deleted first, so all we could do is to create the big Nix cache last somehow. Let's see if we do hit that limit first.

  • Garbage collect nix/stack/cabal caches?

On the one hand, it makes sense for Nix at lease, on the other — there were full Nix cache rebuilds and from what I've noticed, the cache size doesn't seem to increase significantly, if ever; it was 3-something GB all the time.

  • Stack/cabal caches between 300MB and 1GB each

Speaking of Stack caches, somehow Linux cache is 380M, MacOS is 680M and Windows is 1GB, I wonder why?

@wolfgangwalther
Copy link
Member Author

Speaking of Stack caches, somehow Linux cache is 380M, MacOS is 680M and Windows is 1GB, I wonder why?

Yeah, that's something I wondered about, too. I tried stack locally, but I'm getting sizes of more than 2 GB for my ~/.stack folder. This might be what we're seeing on Windows as a slightly compressed cache, though...

I think the difference is that stack would download GHC in those cases, and we'd cache that, too. Which, I think, doesn't make much sense - downloading and installing GHC via stack is by far the fastest piece. Caching the dependencies is the important bit, though.

Maybe we can do better with some combination of ghcup and --system-ghc or so.

@wolfgangwalther
Copy link
Member Author

The current eviction policy is that the oldest caches are deleted first, so all we could do is to create the big Nix cache last somehow. Let's see if we do hit that limit first.

The problem is that the caches from main will by definition always be "the oldest", so any PR cache that blows past the limit would cause some cache from main to be evicted.

@wolfgangwalther
Copy link
Member Author

The problem is that the caches from main will by definition always be "the oldest", so any PR cache that blows past the limit would cause some cache from main to be evicted.

Interestingly, all PR caches from yesterday have already been evicted while main caches are still there. So maybe there is already some built-in prio between those.

@wolfgangwalther
Copy link
Member Author

Since all caches are evicted after 7 days anyway, we could just run a daily workflow to restore (and save?) the current caches on main. This way they'd always be "used last" and hopefully not be evicted instead of PR caches.

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

Interestingly, all PR caches from yesterday have already been evicted while main caches are still there.

Oh, were they? That'd be really useful. Were the corresponding branches deleted or related PRs merged?

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

Maybe we can do better with some combination of ghcup and --system-ghc or so.

We could even set stack up with ghcup, and configure stack in a way it'll use ghcup to install GHC (actually ghcup bootstrap script does it unless you ask it to), and if we cache .ghcup separately, we could share cached GHC between cabal and stack builds. With GHC tarballs are 200 to 300M big, times 3 platforms, this would result in another ~1GB saved cache.

@wolfgangwalther
Copy link
Member Author

we could share cached GHC between cabal and stack builds.

Not really, they use different GHC versions. Stack is on GHC 9.4.5, Cabal on 9.4.8, 9.6.4 and 9.8.1.

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

Since all caches are evicted after 7 days anyway, we could just run a daily workflow to restore (and save?) the current caches on main. This way they'd always be "used last" and hopefully not be evicted instead of PR caches.

Yeah, I think I've suggested it somewhere in comments to nix-cache-related PRs. The eviction policy doesn't mention access time, though, but that seems like the best effort.

@wolfgangwalther
Copy link
Member Author

Oh, were they? That'd be really useful. Were the corresponding branches deleted or related PRs merged?

Ah, I thought we'd have the case with this PR. But this PR only re-uses caches from main, so that just kept those and deleted everything else...

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

Not really, they use different GHC versions

Why do we stick to 9.4.5 for Stack, BTW? 9.4.8 is in Stackage for a month or so

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

Excellent point on caching GHC and dependencies in non-nix builds separately

@wolfgangwalther
Copy link
Member Author

Why do we stick to 9.4.5 for Stack, BTW? 9.4.8 is in Stackage for a month or so

But not available on FreebBSD, yet:

#3155 (comment)

@wolfgangwalther
Copy link
Member Author

So the idea would be to consistently:

  • Install GHC / Stack / Cabal via ghcup. Every time, no cache.
  • Cache dependencies for Stack/Cabal.
  • Cache the actual build for Stack/Cabal separately. This avoids invalidating the dependencies cache for regular code changes.

@wolfgangwalther
Copy link
Member Author

We could probably also just throw out the Linux x64 cabal build with GHC 9.4.8. We are building with 9.4.8 in nix, with 9.4.5 via stack on various platforms and with 9.4.8 on arm via Cabal. That should be enough for 9.4.x series...

@wolfgangwalther
Copy link
Member Author

@develop7 do you see any way of "not storing the cache" with cache-nix-action? I think this is possible in v5, but we have reverted to v4. Maybe we should try v5 again - might not have been buggy at all, maybe we just had our caches invalidated for other reasons or so?

@develop7
Copy link
Collaborator

develop7 commented Feb 7, 2024

do you see any way of "not storing the cache" with cache-nix-action?

@wolfgangwalther perhaps, use restore action only? That's what the cache-nix-action is using internally anyway.

@wolfgangwalther
Copy link
Member Author

@develop7

perhaps, use restore action only? That's what the cache-nix-action is using internally anyway.

Ah, right I remember why I looked at this and then didn't do it: This would require us to put the save part manually in each place where we currently use the setup-nix action. I didn't find a way to add a "post job hook" or something like that in a composite action, yet.

However, v5 of cache-nix-action has the save input which can control this. Using this would allow to keep things simple - just using setup-nix would be enough.

@wolfgangwalther
Copy link
Member Author

I looked at the "Build MacOS (Nix)" job a little bit, because it by far takes the most time to run (compared to other nix jobs). Both on main (with cache) and in this PR (without cache, because the key was changed). The numbers are:

  • The last run on main took 14m 2s. 13m 37s were spent in the "Setup Nix Action" - most of that to download the actions cache of around 3.8 GB.
  • The last run in this PR took 15m 26s, of which 12m 26s were spent downloading from cachix. Nothing was built. "Setup Nix Action" took 2m 46s.

Assuming the 2m 46s are the "base" time for nix setup without cache download, downloading the actions cache took around 11m here. Compare that to 12m 30s for cachix... and I start to wonder why do we bother with caching this stuff via github actions at all? Cachix seems fast enough to me for that purpose.

I can see the github actions cache be useful to:

  • cache stack and cabal
  • cache postgrest-build for later stages of the same workflow

But I seriously question how much we are gaining from caching nix in the actions cache again.

As an experiment, I added a commit to replace the actions cache for this job with something else: Basically a lookup of cachix, which doesn't always involve a full download. This should be very fast when nothing changes - and if it does, it would only download those dependencies that are required for the rebuild. Compare this to downloading everything every time..

@develop7
Copy link
Collaborator

develop7 commented Feb 8, 2024

I start to wonder why do we bother with caching this stuff via github actions at all?

Nix caching does shave off a minute or two per Nix-involving job, which shortens the feedback loop accordingly; usually to these two minutes, due to CI jobs running in parallel. For the macOS case, where there is a single cache consumer job and low-performing runner, it certainly does make sense to drop cache altogether.

This is my first approach to GitHub Actions and their cache in particular, and, being honest, it is performing worse than I've expected, especially Windows/Mac runners (download speed is barely 100Mb/sec (down to abysmal 20MB/sec even), while in Linux runners it's 380Mb/s sometimes). With the dedicated infrastructure, I expect the improvement to be more significant, but currently it might not be worth the effort.

@wolfgangwalther
Copy link
Member Author

As an experiment, I added a commit to replace the actions cache for this job with something else: Basically a lookup of cachix, which doesn't always involve a full download. This should be very fast when nothing changes - and if it does, it would only download those dependencies that are required for the rebuild. Compare this to downloading everything every time..

So that didn't work at all in CI, while locally it was fine. I have no idea, yet, what's happening there.

I didn't find a way to add a "post job hook" or something like that in a composite action, yet.

This is done now - I was able to keep it all in the setup-nix action.

Copy link
Collaborator

@develop7 develop7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, hit it

@wolfgangwalther
Copy link
Member Author

wolfgangwalther commented Feb 14, 2024

I have a lot of open TODOs, there is nothing to merge from me right now. The code still in this PR is not done from my POV, yet.

Plus: Literally, the MacOS job is failing in this state...

@develop7
Copy link
Collaborator

I mean, the parts you have done already could be extracted to separate PR and merged, making this PR smaller and easier to review in the end.

This restores caches on all branches and pull requests, but only stores
them on the main branch and release branches. This prevents those caches
from being evicted early when we hit the 10 GB limit quickly.
This is a first step to split up the cabal and stack caches in separate
pieces. Here we split the work folder, which just contains the
postgrest-specific build artifacts, into a separate cache.

More fine-grained caching should give us better cache hits and much
fewer upload size in the regular case, improving CI performance.

Since the work file caches are very small (about 30-40 MB) they are
cached for PRs, too. This will allow the majority of PRs, which only
change source code files, but no dependencies, to still have cached
their build files for additional commits.
This removes the GHC install from stack caches to reduce size.
This action makes sure to always have the correct GHC and/or stack
version installed in all environments. This solves problem where ghc or
stack might not be available on newer macos images anymore or where
ghcup is not available by default on our new custom github runner on
arm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants