-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: New Healthy
condition for claims and XRs
#5643
Comments
CC @dalton-hill-0 @TerjeLafton @jbw976 - Just an FYI as this is tangentially related to designs you've all been working on. |
Hi @negz Very good feature. It will help a lot to a fast investigation on issues. We can see a bit of the Health status via crossplane beta trace cli command, but its not clear when you get status only of one resource. |
I think this would be very helpful, struggling with this atm. |
If anyone from the community wants to give this a shot, I'd be supportive. I think it would require a little experience with Crossplane though. |
Would composition functions be able to set/override the message for the |
@dalton-hill-0 I'm not sure. When would a function want to do that? For an XR I imagine the messages would be one of the following when the XR wasn't
With sync issues trumping composed resource issues if both were true (since the XR not being synced means it might not have the latest composed resource status). A function could indirectly influence |
Executive summary
I propose we add a new
Healthy
status condition to claims and XRs. This condition would indicate whether the resource and all of its child resources were healthy - i.e. weren't currently encountering any errors.This would make it possible to quickly determine whether a claim isn't ready because it's hitting an error, or because it just needs more time to become ready.
Importantly, the new condition wouldn't include the specific error(s) child resources were encountering. It would only indicate that there were or weren't errors.
What problem are you facing?
Claims, XRs, and MRs have the following status conditions:
Ready
- Indicates that Crossplane believes the resource is ready to useSynced
- Indicates that Crossplane successfully synced the resource on the last reconcileReady
doesn't mean "Crossplane is successfully reconciling this resource". It means "the thing this resource represents appears to be ready to use". For example if a claim represents a database, theReady
condition indicates that Crossplane believes the database is ready to use. It's ready for some application to connect to it and store data.Ready
is cumulative. A claim is onlyReady
when its underlying XR isReady
. The XR is onlyReady
when all its composed resources areReady
.Synced
does mean "Crossplane is successfully reconciling this resource". It isn't cumulative, though. It only means "this controller reconciled its most recent desired state successfully". For example an XR could reconcile its desired state by creating three misconfigured composed resources that would never becomeSynced
orReady
. The XR is stillSynced
though, because its controller did its job. It successfully synced its desired state by creating composed resources.I think these two conditions make sense. For example:
Synced
because Crossplane has successfully applied its desired state, but not yetReady
to use.Ready
to use (as far as Crossplane knows), but no longerSynced
because Crossplane is hitting an error.There's an information gap though. Today if you look at a newly created claim you'll likely see:
Synced: True
- The claim successfully created and is syncing with an XRReady: False
- The conceptual resource the claim represents (e.g. a database) isn't ready to use yetIt's possible the XR controller, and all composed resource controllers are working fine but some MR isn't ready to use yet - e.g. the RDS API is reporting that a composed RDS instance isn't ready to use. It's also possible that one of the controllers is stuck due to an error.
Today there's no way to distinguish between these two cases without inspecting the next layer down.
Folks have been feeling the pain of this gap for some time, but I've been reluctant to address it because the proposed solution has always been "take the errors from lower level controllers and propagate them to the higher level controllers" - for example #3262.
@vfarcic explains why that isn't a great solution in https://www.youtube.com/watch?v=xAl3TAfFE_M. Claims, XRs, and MRs are all different layers of abstraction, with different intended audiences. An error that makes sense at one layer probably won't make sense at a higher layer of abstraction.
Consider for example an XR error due to an invalid Composition. I don't think it makes sense to copy that error verbatim to the claim. Assuming the claim author isn't writing their own Composition, the error isn't actionable for them.
One way to address this is to build a presentation layer that allows Composition authors to choose which errors were surfaced to the claim and how. That's what we're working on in #5402.
The downside of the presentation layer approach is that the information gap still exists by default. If the Composition author doesn't actively "present" errors to the claim, there's still no way to distinguish between a claim that's not ready due to an error, or a claim that's not ready because the cloud provider is still working on creating one of the composed resources.
How could Crossplane help solve your problem?
I propose we add a new
Healthy
status condition to claims and XRs.A claim would be
Healthy
only when:Synced
Synced
andHealthy
An XR would be
Healthy
only when:Synced
Synced
, and if applicable (e.g. the composed resource is an XR)Healthy
The
Healthy
condition would in effect be a cumulative version of theSynced
condition. I'm proposing a new condition instead of proposing that we change the meaning of theSynced
condition because changing the meaning would be a significant behavior change.Importantly, I'm not proposing that we copy error messages from child to parent resources. For example if a claim wasn't
Healthy
I'd expect the status condition message to say something generic like "N/M composed resources are not synced" or just "Some resources encountered errors".Under this proposal a claim status might look like this:
An XR status might look like this:
The text was updated successfully, but these errors were encountered: