-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pulumi State Drifts from Kubernetes Cluster State #2668
Comments
@scottslowe This sounds expected. If pulumi is interrupted, it won't have a chance to write state and wouldn't even know the outcome of the operation. What do you expect to behave differently? |
@mikhailshilkov It seems to me there's an atomicity issue here. Pulumi is updating Kubernetes (which triggers a rollout on the Deployment), but doesn't appear to be updating its own state until some point afterward (perhaps waiting for the rollout to complete?). In that time period---between when Kubernetes has the desired state and Pulumi does not---there's room for the configuration to drift (i.e., Pulumi gets interrupted, Kubernetes has been updated but Pulumi has not). From my (perhaps naive) point of view, we should be updating Pulumi's state at the same time (or as close as possible) as Kubernetes' state is being updated. Is that not the case currently? |
Looks like this regards https://github.com/pulumi (not pulumi-kubernetes) |
Hey, just dropping in to mention I'm the person who opened a discussion about this with @scottslowe on Slack and he opened this issue on my behalf. I've mostly been using the managed backend in the projects I've worked on with Pulumi, but the state drift can occur on both managed and self-hosted backends from my observations. The gist of the issue seems to have been described correctly (Pulumi can and will write changes to the k8s state before committing any trace of those changes to its own state, which in some scenarios leads to state drift). To me it seems like the issue could be resolved by maintaining some kind of a write-ahead log in the pulumi state which can be used to pick up interrupted deployments (or at the very least properly clean them up). Perhaps this is even done, but it gets incorrectly rolled back on a failed deployment (even though state has already been committed and not cleaned up)? Another potentially useful observation is that this has never happened if pulumi isn't configured to wait for the resource to become live. I would not be ready to say the waiting is what causes the bug however, it could simply be a factor that increases the timing window where it can occur, and deployments where pulumi isn't configured to wait still has the same bug but just a much shorter timing window where it can actually occur. I'm not sure which component this issue really belongs to under pulumi, but it has been an issue for long enough that e.g. recently I was not able to recommend a customer to use the pulumi k8s operator to automate their pulumi stack deployments due to the only way to fix this state drift is somewhat manual. The best automated alternative that I'm aware of would be to spam pulumi refresh before/after each deployment, but I suspect there's a reason why pulumi doesn't already do that by default, and that reason is because it leads to detecting changes which don't tangibly matter in most cases & you'll have to maintain a set of ignore changes rules on every resource. To begin with I'm not sure if refresh would fix it, because refresh only refreshes the state of tracked resources, and sometimes the problem is that pulumi is not tracking them. I can't promise I'll have too much time to put into this before next year, but I'd appreciate if someone can digest this issue into something potentially actionable I can contribute towards when I next run into it (giving me a reason to do so). |
@scottslowe @awoimbee @JoaRiski the Pulumi engine is currently only able to update state once it gets a response from the provider's RPC handler. With this unary model there will always be a race condition where hard interruptions ( Under normal circumstances, when the RPC completed but the resource didn't become ready, we return an ErrorResourceInitFailed -- see here. That should take care of updating Pulumi's state to match the cluster. Soft interruptions ( A couple things that would be helpful to know:
|
What happened?
A container image tag was changed and Pulumi was in the process of applying that change to the Kubernetes cluster, but the process was interrupted. As a result, Pulumi never committed the change to its own state even though it was already pushed to the cluster. As a result, Pulumi state drifts away from actual in-cluster state, and future runs of Pulumi won't pick up changes as expected.
Example
pulumi up
Output of
pulumi about
Unknown (filing on behalf of a customer)
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: