Pulumi State Drifts from Kubernetes Cluster State #2668

scottslowe · 2023-11-15T16:08:26Z

What happened?

A container image tag was changed and Pulumi was in the process of applying that change to the Kubernetes cluster, but the process was interrupted. As a result, Pulumi never committed the change to its own state even though it was already pushed to the cluster. As a result, Pulumi state drifts away from actual in-cluster state, and future runs of Pulumi won't pick up changes as expected.

Example

We update the image tag in the Pulumi stack config manually (the image building and pushing process is managed outside Pulumi)
We run pulumi up
Pulumi sees the desired target image mismatches with the current image by checking its own state and comparing against it
Pulumi will appropriately start a rollout on the Deployment
Pulumi updates the Deployment config in the k8s cluster to point to the new target image. THIS IS WHERE IT SHOULD ALSO UPDATE THE PULUMI STATE BUT DOES NOT.
At this point changes have been committed to the k8s cluster but not to the Pulumi state
Pulumi deployment gets interrupted for any reason -> k8s state has been modified but Pulumi state has not.
On subsequent Pulumi runs, Pulumi is not aware if the change that was applied to the k8s cluster, as it never committed the change to its own state due to the process being interrupted.

Output of `pulumi about`

Unknown (filing on behalf of a customer)

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

mikhailshilkov · 2023-11-15T19:21:26Z

@scottslowe This sounds expected. If pulumi is interrupted, it won't have a chance to write state and wouldn't even know the outcome of the operation. What do you expect to behave differently?

scottslowe · 2023-11-15T21:00:32Z

@mikhailshilkov It seems to me there's an atomicity issue here. Pulumi is updating Kubernetes (which triggers a rollout on the Deployment), but doesn't appear to be updating its own state until some point afterward (perhaps waiting for the rollout to complete?). In that time period---between when Kubernetes has the desired state and Pulumi does not---there's room for the configuration to drift (i.e., Pulumi gets interrupted, Kubernetes has been updated but Pulumi has not). From my (perhaps naive) point of view, we should be updating Pulumi's state at the same time (or as close as possible) as Kubernetes' state is being updated. Is that not the case currently?

awoimbee · 2023-11-16T09:08:16Z

Looks like this regards https://github.com/pulumi (not pulumi-kubernetes)
Note that pulumi writes checkpoints and that the managed backend is more robust than the self-hosted one.
Also note that after a sigint, pulumi will write everything down (a second sigint or a sigterm will immediately close the program).

JoaRiski · 2023-11-20T14:59:08Z

Hey, just dropping in to mention I'm the person who opened a discussion about this with @scottslowe on Slack and he opened this issue on my behalf.

I've mostly been using the managed backend in the projects I've worked on with Pulumi, but the state drift can occur on both managed and self-hosted backends from my observations. The gist of the issue seems to have been described correctly (Pulumi can and will write changes to the k8s state before committing any trace of those changes to its own state, which in some scenarios leads to state drift).

To me it seems like the issue could be resolved by maintaining some kind of a write-ahead log in the pulumi state which can be used to pick up interrupted deployments (or at the very least properly clean them up). Perhaps this is even done, but it gets incorrectly rolled back on a failed deployment (even though state has already been committed and not cleaned up)?

Another potentially useful observation is that this has never happened if pulumi isn't configured to wait for the resource to become live. I would not be ready to say the waiting is what causes the bug however, it could simply be a factor that increases the timing window where it can occur, and deployments where pulumi isn't configured to wait still has the same bug but just a much shorter timing window where it can actually occur.

I'm not sure which component this issue really belongs to under pulumi, but it has been an issue for long enough that e.g. recently I was not able to recommend a customer to use the pulumi k8s operator to automate their pulumi stack deployments due to the only way to fix this state drift is somewhat manual. The best automated alternative that I'm aware of would be to spam pulumi refresh before/after each deployment, but I suspect there's a reason why pulumi doesn't already do that by default, and that reason is because it leads to detecting changes which don't tangibly matter in most cases & you'll have to maintain a set of ignore changes rules on every resource. To begin with I'm not sure if refresh would fix it, because refresh only refreshes the state of tracked resources, and sometimes the problem is that pulumi is not tracking them.

I can't promise I'll have too much time to put into this before next year, but I'd appreciate if someone can digest this issue into something potentially actionable I can contribute towards when I next run into it (giving me a reason to do so).

blampe · 2024-05-08T23:56:35Z

@scottslowe @awoimbee @JoaRiski the Pulumi engine is currently only able to update state once it gets a response from the provider's RPC handler. With this unary model there will always be a race condition where hard interruptions (kill -9) can leave state out of sync. pulumi/pulumi#15958 and pulumi/pulumi#5210 track work in this area.

Under normal circumstances, when the RPC completed but the resource didn't become ready, we return an ErrorResourceInitFailed -- see here. That should take care of updating Pulumi's state to match the cluster.

Soft interruptions (ctrl-C) invoke the provider's Cancel handler to give it an opportunity to return early from its work. ~~I do see a potential bug where our cancellation logic (and timeout handling) doesn't seem to return ErrorResourceInitFailed as you would expect.~~ Edit: after looking more and attempting to repro this, we do appear to be handling cancelation correctly.

A couple things that would be helpful to know:

Are there any resource types in particular where you happen to see this more frequently?
Are the interruptions typically due to ctrl-C, timeouts, or something else?

scottslowe added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 15, 2023

mikhailshilkov added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Nov 15, 2023

mikhailshilkov added needs-triage Needs attention from the triage team and removed awaiting-feedback Blocked on input from the author labels May 8, 2024

blampe added area/await-logic awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulumi State Drifts from Kubernetes Cluster State #2668

Pulumi State Drifts from Kubernetes Cluster State #2668

scottslowe commented Nov 15, 2023

mikhailshilkov commented Nov 15, 2023

scottslowe commented Nov 15, 2023

awoimbee commented Nov 16, 2023

JoaRiski commented Nov 20, 2023 •

edited

blampe commented May 8, 2024 •

edited

Pulumi State Drifts from Kubernetes Cluster State #2668

Pulumi State Drifts from Kubernetes Cluster State #2668

Comments

scottslowe commented Nov 15, 2023

What happened?

Example

Output of pulumi about

Additional context

Contributing

mikhailshilkov commented Nov 15, 2023

scottslowe commented Nov 15, 2023

awoimbee commented Nov 16, 2023

JoaRiski commented Nov 20, 2023 • edited

blampe commented May 8, 2024 • edited

Output of `pulumi about`

JoaRiski commented Nov 20, 2023 •

edited

blampe commented May 8, 2024 •

edited