RFC 64: Asset garbage collection #379

kaizencc · 2021-09-03T21:42:30Z

This is a request for comments about Asset Garbage Collection. See #64 for
additional details.

APIs are signed off by @njlynch.

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache-2.0 license

text/0064-asset-garbage-collection.md

rix0rrr · 2021-09-06T12:33:51Z

text/0064-asset-garbage-collection.md

+ -t, --type=[s3|ecr] filters for type of asset
+
+Examples:
+ cdk gc


Can we make this zero-touch?

Could it be something we install into the account that runs periodically? I think that should be the end goal of the user experience.

We could complement that with a CLI experience, but only if one leads naturally into the other (in terms of development effort).

I could be way off base but a zero-touch solution would likely be a lambda function that runs once a week (or however long) versus the CLI experience which is by manual command. But at the end of the day, the actual code that gets run (tracing the assets in the environment and then deleting the unused ones) should be exactly the same.

It seems like if the CLI experience works then integrating it into a zero-touch solution would not require much additional work.

Yes, +1 on @kaizen3031593's response - I agree that eventually a scheduled tasks runs GCs is what users would want but I also think it makes sense to start by implementing cdk gc as the foundational building block that will be used in a automated task.

After we have a solid manual cdk gc (which will take some time to stabilize), we (or the community) can vend a construct that runs this periodically and maybe at some point we can include that in our bootstrapping stack.

@kaizen3031593 it's worth mentioning this "roadmap" in your RFC so that this vision is explicitly articulated.

It would be interesting to be able to opt-in to this behavior as part of the bootstrap process. The CdkToolkit stack could maybe be the owner of the scheduled Lambda function that runs the GC process.

rix0rrr · 2021-09-06T12:40:50Z

text/0064-asset-garbage-collection.md

+assets that are being referenced by these stacks. All assets that are not reached via
+tracing can be safely deleted.


Something to think about which I think is not optional: how will we leave a safety margin for rollbacks?

As an example:

A pipeline deploys revision R1 with some version of some Lambda code (r1.zip)

A pipeline deploys revision R2, changing the Lambda code (r2.zip)

Garbage collection gets run (deleting r1.zip)

For whatever reason, the app team wants to rollback to R1, but r1.zip is now gone

I guess the answer to what we need to do in GC depends on how rollback is going to work. If it's going to be a full cdk deploy cycle we could rely on the CDK CLI to re-upload the assets... but how do CD systems work?

In a CD system like Amazon uses internally the CFN deployment would be executed, but the asset deployment wouldn't (and so r1.zip would be gone without recovery)

CodePipeline doesn't currently support rollbacks, but it could in the future (and chances are it would work like Amazon's does)

All in all, you need to think about this case and think about what is reasonable to expect from a CI/CD system and what is reasonable for us to implement, and write a couple words about that.

For example, a reasonable safety precaution would be we only delete assets 30 days (or pick a number) after they stop being used. That would allow for safe interoperability even with pessimistic CI/CD scenarios.

This is a good idea and something I can integrate into the RFC. I'll start with 30 days, since I'm not all that familiar with how people use cloudformation rollbacks I don't have an idea of what (if any) is a more suitable time period.

For this

This is a good idea and something I can integrate into the RFC. I'll start with 30 days, since I'm not all that familiar with how people use cloudformation rollbacks I don't have an idea of what (if any) is a more suitable time period.

I don't think this can entirely be solved with timestamps. As I understand there are two kinds of race conditions:

You fetch the templates when there were some in-progress stacks. If you deleted the assets cross-referencing such templates you might delete assets that would be needed if one or more of these in-progress stacks rolled-back (because the CFn template got would be the new one which was cause of the update). This cannot be remedied with comparing asset timestamps as it's orthogonal to that - it's about the 2 versions rather than two timestamps.

You could also race with asset-uploads. If an asset got uploaded but the changeset wasn't executed yet and your workflow ran and referenced the about to be updated template, then you would delete the asset that just got uploaded and the stack update would fail. This is better than a rollback failure of (1) as it'll be an update failure - but still a failure.

I think (2) can be remedied by using a timestamp - leave alone all assets not older than X days where X is the max likely time difference in someone's pipeline between the asset-upload stage and changeset execution stage.

For (1), assuming that is correct, there's some atomicity needed. I guess if we could atomically ask for stack-status+template in the same call, we could decide to run the workflow only if none of the stacks were in in-progress state (as that can rollback to a template which you don't know about). Then combine it with the timestamp check of (2) so that it doesn't race with newer deployments. This will still race with create-in-progress as such a stack might not even have existed during the scan and you will end up deleting the assets of about to be created stack. Again age-check of (2) could remedy this. If there are multiple pipelines (thus multiple stacks) deploying to the same account+region, you will need to configure X to be the worst case max among all the pipelines reaching that stage (obviously - but just stating).

Probably not going to happen, but if we had per stack asset bucket-prefix (or entire bucket) and ecr, then we could simplify all this in case of pipeline deployments by having a post-deployment action that cleans all these assets blindly by referring to the just succeeded stack. We don't have to bother about other stacks in the account + it'll only happen after a successful update of a stack (once the stack is "stable") - so no need to worry about in-progress states. This combined with the "age" workaround might have been the best? The drawback is if you have several stacks/templates that actually reference the same asset, that would lead to asset duplication - but maybe that's OK?

blimmer

I think this is a great start, I like how you consider this a step-1/building block for gc, which many of my clients using CDK are increasingly needing to support.

jogold · 2022-02-01T16:21:20Z

See https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/README.md for a working construct that does asset garbage collection.

blimmer · 2022-02-04T18:44:02Z

@jogold - thank you for this construct. I tried it out on my personal CDK project and it worked great! I think this is an excellent proof-of-concept and would love to see it eventually integrated into the official CDK project.

kaizencc · 2022-03-16T20:22:39Z

@jogold thanks for the POC for asset garbage collection! This RFC is something that we have on our radar for later this year and when the time comes I'd be happy to iterate on this and get it into the CDK.

ustulation · 2022-10-24T15:17:25Z

hello @kaizencc / others , I've been following this for a long time and though about pinging to find out if this is now being worked upon as you mentioned that it'll happen later in the year and we are now nearing the end of the year.

ustulation · 2022-10-24T15:23:38Z

The problem I'm facing currently is that various compliance software highlight that there are insecure images in the ECR - it turns out they are pretty similar to what the automatic ECR scans provide (specially the ones labelled High for severity). They are all images from the past which are no longer in use as newer versions were pulled and used and so on. However since they all linger around, the compliance check fails because it doesn't know that we aren't using the older ones.

This ofc turns out to be a big problem for orgs which need to stick to such compliance checks, so we end up cross-referencing the CFn template and manually deleting the old images no longer in use.

ustulation · 2022-11-04T17:27:09Z

Assuming the problems mentioned here are real and not due to my lack of knowledge (please correct me otherwise, else I might just be overthinking all this) here's an algorithm that could work in practice:

List all the stacks in the given env (account+region)
For each stack, get the template body and immediately follow it by
Get the stack status
If the status is one of {UPDATE_IN_PROGRESS, UPDATE_FAILED, UPDATE_ROLLBACK_IN_PROGRESS}, abort the garbage cleanup as it's not safe - re-run the entire cycle at some later point, or whenever the user chooses
Once all stack-templates are collected, extract all the hashes (the usual [a-f0-9]{64} assuming they are always going to be smaller cased).
Check the cdk-staging bucket and ecr repo for all assets which contain any of the collected hashes. Leave them alone.
For the rest, which aren't referenced by the templates, if they are older than X, delete them.

In absence of being able to atomically get both the template and the stack-status, this should practically work, though it has theoretical edge cases.

The most interesting here is the transition from 2 to 3. The only time we have a problem is when we get the template which was currently being applied (an UPDATE_IN_PROGRESS). That can be rolled back by CFn to a template we did not manage to collect, so we can't consider the one that we did and must abort. It can have a false positive for abortion in the case when you got 2 the stack was stable and it only went into an UPDATE between 2 and 3. In this case you could actually use the template you got and carry on, but since there is no way of knowing that, you just assume whatever 3 gives to be what the state was when 2 happened. So we play it safe and abort.

UPDATE_ROLLBACK_IN_PROGRESS in 3 is also not safe. It could also have been UPDATE_ROLLBACK_IN_PROGRESS in 2 in which case it would have been safe as the rollback template is what we got and what CFn will eventually finish up with, but there's no way of knowing this. In a bad case 2 could have been UPDATE_IN_PROGRESS but it went to UPDATE_ROLLBACK_IN_PROGRESS in 3 and the template has changed but you now have the template that is no longer good and you could end up deleting resources pointed to by the template being rolled back to (timestamps don't help, those resources might be a year old for eg. - you simply must not delete them). Since the transition from UPDATE_IN_PROGRESS -> UPDATE_ROLLBACK_IN_PROGRESS happens quickly, depending on the network latency etc. there is a realistic chance of this race being experienced in practice between steps 2 and 3.

Same with UPDATE_FAILED. UPDATE_IN_PROGRESS -> UPDATE_FAILED is fairly quick (even quicker than the one above). No use risking it, so abort.

Edge case:

The rest should be practically OK. There is a theoretical chance that the latency between calls 2 and 3 was big enough that in the meantime the stack went all the way from UPDATE_IN_PROGRESS state to UPDATE_ROLLBACK_COMPLETE state. So you would interpret the template you got in 2 as that of stack state when it's in UPDATE_ROLLBACK_COMPLETE state which you got in 3, and since that is a stable state (CFn will not apply any further updates by itself without user prompting it), you will "consider" that template when you actually shouldn't since that was the discarded template by CFn due to rolling back. However in practice, the time diff between 2 and 3 is in milliseconds or at worst a second or two. I've never seen a transition from UPDATE_IN_PROGRESS/UPDATE_FAILED (actual state during "2") --to--> UPDATE_ROLLBACK_IN_PROGRESS --to--> UPDATE_ROLLBACK_COMPLETE (actual state during "3") happen so fast that this will be a problem. I had to put a sleep between 2 and 3 to simulate this.

All other cases, such as you missed collecting a stack because it was CREATE_IN_PROGRESS after 1, or if some assets were just uploaded but you finished 2 and 3 for a stack that is soon going to be updated as you do 7 etc., can be solved by a timestamp comparison. While dealing with an environment, allow for the biggest interval for your pipeline between asset upload stage and CFn changeset execute stage in that env, and only delete assets that are older than this interval from the time the gc workflow runs for that env. It'll be larger for later stages in the pipeline (or you can just set a blanket expiry interval of X days + whatever-time-it-normally-takes-for-the-pipeline-to-reach-the-last-stage.

This doesn't deal with IMPORT_* family of statuses because i've never used them personally, but i believe it should follow the same logic as UPDATE_* above?

Alternatives

If there was a CFn API call to (atomically) get both the template and the status then you wouldn't have to handle some of the case above explicitly. You would still need to either abort or poll-and-wait in case of UPDATE_IN_PROGRESS/UPDATE_FAILED because you would have got the template that would soon be discarded by CFn and rolled back to one which you don't have.
If CFn had an API call that returned either one template (which is where the CFn "stopped" after applying - ie. reached a stable state) or two templates (the one CFn is currently applying + the one it would rollback to if the update failed), then you don't have to deal with this mess at all - just use both the templates when available and blindly delete resources not referenced by any of such collected templates (from all stacks in the env). This would be awesome! You will still need the timestamp check for the assets belonging to stacks you didn't know about during your scan etc. But that's always going to be the case unless there's an impossible global lock for the account.
Have a different bootstrap-qualifier for each pipelined project. That way every stage can have an additional stack which depends on all the other stacks for the project in that stage, and whose sole job is to run the gc workflow after every deployment. It will only run once the upstream stacks are updated/created successfully and are stable. Or have a post-deployment action per stage in the pipeline that does this. In either case, the pipeline locks the deployment to the stage until all actions/updates/rollbacks complete, so you don't have to care about stack-status. Further due to separation of projects using cdk bootstrap-qualifier, the projects have their dedicated bucket/ecr-repo per stage and other stacks from other projects don't interfere. This would be a clean solution too i guess? Asset duplication would be a drawback - e.g. the source code of log-retention lambda etc. are usually shared for the same env. Now they'll be in per project asset/staging buckets - which may not be all that bad.

ustulation · 2022-11-04T18:10:36Z

I'm currently coding this algorithm for my project with an additional check to retry N number of times (and then abort), for every stack for which the time diff between start of 2 and end of 3 above is >2secs to safeguard against the edge case listed above.

kaizencc added 3 commits September 3, 2021 17:36

add asset garbage collection rfc

72262bc

edited paragraph

a3b720d

linter issues

e07ab3b

eladb reviewed Sep 5, 2021

View reviewed changes

text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved

text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved

text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved

text/0064-asset-garbage-collection.md Show resolved Hide resolved

incorporate CR comments

bd0d0fc

rix0rrr reviewed Sep 6, 2021

View reviewed changes

incorporate PR comments

8a5a734

blimmer approved these changes Dec 1, 2021

View reviewed changes

eladb mentioned this pull request Feb 1, 2022

feat(toolkit-cleaner): clean unused S3 and ECR assets from CDK Toolkit jogold/cloudstructs#151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 64: Asset garbage collection #379

RFC 64: Asset garbage collection #379

kaizencc commented Sep 3, 2021

rix0rrr Sep 6, 2021

kaizencc Sep 8, 2021

eladb Sep 9, 2021 •

edited

blimmer Dec 1, 2021

rix0rrr Sep 6, 2021

kaizencc Sep 8, 2021

ustulation Nov 2, 2022

blimmer left a comment

jogold commented Feb 1, 2022

blimmer commented Feb 4, 2022

kaizencc commented Mar 16, 2022

ustulation commented Oct 24, 2022

ustulation commented Oct 24, 2022

ustulation commented Nov 4, 2022

ustulation commented Nov 4, 2022

		assets that are being referenced by these stacks. All assets that are not reached via
		tracing can be safely deleted.

RFC 64: Asset garbage collection #379

Are you sure you want to change the base?

RFC 64: Asset garbage collection #379

Conversation

kaizencc commented Sep 3, 2021

rix0rrr Sep 6, 2021

Choose a reason for hiding this comment

kaizencc Sep 8, 2021

Choose a reason for hiding this comment

eladb Sep 9, 2021 • edited

Choose a reason for hiding this comment

blimmer Dec 1, 2021

Choose a reason for hiding this comment

rix0rrr Sep 6, 2021

Choose a reason for hiding this comment

kaizencc Sep 8, 2021

Choose a reason for hiding this comment

ustulation Nov 2, 2022

Choose a reason for hiding this comment

blimmer left a comment

Choose a reason for hiding this comment

jogold commented Feb 1, 2022

blimmer commented Feb 4, 2022

kaizencc commented Mar 16, 2022

ustulation commented Oct 24, 2022

ustulation commented Oct 24, 2022

ustulation commented Nov 4, 2022

Edge case:

Alternatives

ustulation commented Nov 4, 2022

eladb Sep 9, 2021 •

edited