WaitForPodsReady: Store last requeued count and time #2175

tenzen-y · 2024-05-09T22:38:01Z

What would you like to be added:
Depend on #2174

Since #2063, the workload controller resets the .status.requeueState.requeueAt if the requeueAt exceeds the current time.
Also, we will reset the .status.requeueState.count as well to fix the bug reported in #2174.

So, I would like to propose the dedicated APIs that do not involve scheduling like this:

[...]
status:
  lastRequeueState:
    requeueAt: $TIME
    count: 3
[...]

Why is this needed:
As initially designed, the requeueState is responsible for storing the last requeued time and counting and notifying the users as well.
But, to avoid the race condition, we dropped/will drop the functionality from those APIs (.status.requeueState) in the Workload.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

tenzen-y · 2024-05-09T22:38:49Z

cc: @alculquicondor @mimowo

alculquicondor · 2024-05-10T13:52:26Z

Since #2063, the workload controller resets the .status.requeueState.requeueAt if the requeueAt exceeds the current time.

What about just reverting that?

Would it be enough?

alculquicondor · 2024-05-10T13:58:08Z

I would be worried about adding more fields that could just cause confusion.

But maybe it isn't too bad.

WDYT @mimowo?

tenzen-y · 2024-05-10T14:05:41Z

@alculquicondor Sorry for the confusion. Actually, we need to reset the .status.requeueState.count field to avoid the race condition. So, could you check #2174 first? Thanks.

mimowo · 2024-05-10T14:19:37Z

I would be worried about adding more fields that could just cause confusion.

But maybe it isn't too bad.

WDYT @mimowo?

I think we need to reset the .status.requeueState.count field indeed, because currently when an admin re-activates the workload it gets re-evicted immediately. So the admin needs to first clean .status.requeueState.count , then activate: spec.active=true. This is extra step and room for error.

Now, if we clear .status.requeueState.count it will work, but as discussed offline with @tenzen-y the admins may want to know how many retries it took before deactivating. For that, the minimal approach would be to just record the count the in the message for PodsReadyTimeout, say Exceeded the PodsReady timeout %s -> The PodsReady timeout %s was exceeded %v times in a row. And I would hope this message is enough. WDYT @tenzen-y ?

If the message is not enough, because some automation wants to consume the information about the number of retries, then I think the proposal with status.lastRequeueState makes sense. We do something similar for pods with lastTerminationState link.

alculquicondor · 2024-05-10T15:05:57Z

I like the idea of just updating the condition message. The users just need a quick signal to understand what happened.

tenzen-y · 2024-05-10T15:16:55Z

Now, if we clear .status.requeueState.count it will work, but as discussed offline with @tenzen-y the admins may want to know how many retries it took before deactivating. For that, the minimal approach would be to just record the count the in the message for PodsReadyTimeout, say Exceeded the PodsReady timeout %s -> The PodsReady timeout %s was exceeded %v times in a row. And I would hope this message is enough. WDYT @tenzen-y ?

My motivation is to provide a machine-readable state. So, I would prefer to have lastRequeueState.
But, as I mentioned here, when we deactivate the Workload, adding Evicted condition instead of modifying the .spec.active field allows us to avoid the race condition and avoid to add a new .status.lastRequeueState field.

@alculquicondor @mimowo Regarding this idea, WDYT?

mimowo · 2024-05-10T15:24:45Z

My motivation is to provide a machine-readable state.

Do you have a concrete use case where this information would be parsed by automation? If not, then it can introduce some form of keeping the information structured (like lastRequeueState) when requested by users.

@alculquicondor @mimowo Regarding #2174 (comment) idea, WDYT?

I'm not sure. Wouldn't we clear the status.requeuingState.count in that case?

mimowo · 2024-05-10T15:27:56Z

Actually, if we need this information structured I would be leaning towards using lastRequeuingState, by analogy to lastTerminationState API. Seems cleaner to have dedicated API for this purpose than overload requeuingState with two responsibilities: driving the mechanism, and serve for the structured reason of deactivation.

alculquicondor · 2024-05-10T15:30:07Z

+1 to not reuse, but I still want to know why would automation need to know all of this details? As opposed to just a reason in the Evicted condition.

tenzen-y · 2024-05-10T15:40:45Z

Do you have a concrete use case where this information would be parsed by automation? If not, then it can introduce some form of keeping the information structured (like lastRequeueState) when requested by users.

In the platform engineering context, the admins (SWE/Ops/SRE) often develop and provide common platforms across the company to users (Researcher/DS/ML Engineer).
In that case, we wouldn't provide permissions to operate Kubernetes to users since the users often don't know Kubernetes, and also we would handle security policy (similarly IAM concept).

So, we often provide in-house CLI, and Concole wrapped to allow users to operate Jobs (Create/List/Delete).
In that internal platform, we would notify users the requeueState.count and requeueState.requeueAt via in-house API and CLI, and Console.

Therefore, I would like to provide machine-readable API via Workload resources.

alculquicondor · 2024-05-10T15:43:33Z

In that case, let's go with the dedicated API

tenzen-y · 2024-05-10T15:49:54Z

In that case, let's go with the dedicated API

As you are pointing out here, it seems that we can not avoid resetting the requeueState...

So, if @alculquicondor and @mimowo are ok, I would like to add a dedicated API (.status.lastRequeueState).

tenzen-y · 2024-05-10T16:28:47Z

/assign

alculquicondor · 2024-05-16T19:23:34Z

In addition to the lastRequeueState, I was wondering if it's worth also adding an Evicted condition right before deactivating the Workload?
That could hold a proper reason.
Otherwise, if we just deactivate, this would create its own Evicted condition with a reason Deactivated

tenzen-y · 2024-05-17T16:54:34Z

In addition to the lastRequeueState, I was wondering if it's worth also adding an Evicted condition right before deactivating the Workload? That could hold a proper reason. Otherwise, if we just deactivate, this would create its own Evicted condition with a reason Deactivated

Yeah, I also think it would be worth a dedicated reason.

tenzen-y added the kind/feature Categorizes issue or PR as related to a new feature. label May 9, 2024

tenzen-y mentioned this issue May 9, 2024

WaitForPodsReady: After a Workload is deactivated by exceeded the backoffLimit, the reactivated Workload could be immediately re-deactivated #2174

Closed

k8s-ci-robot assigned tenzen-y May 10, 2024

tenzen-y mentioned this issue May 21, 2024

WaitForPodsReady: Reset .status.requeueState.count once the workload is deactivated #2219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WaitForPodsReady: Store last requeued count and time #2175

WaitForPodsReady: Store last requeued count and time #2175

tenzen-y commented May 9, 2024 •

edited

tenzen-y commented May 9, 2024

alculquicondor commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

mimowo commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

mimowo commented May 10, 2024

mimowo commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

tenzen-y commented May 10, 2024

alculquicondor commented May 16, 2024

tenzen-y commented May 17, 2024

WaitForPodsReady: Store last requeued count and time #2175

WaitForPodsReady: Store last requeued count and time #2175

Comments

tenzen-y commented May 9, 2024 • edited

tenzen-y commented May 9, 2024

alculquicondor commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

mimowo commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

mimowo commented May 10, 2024

mimowo commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

alculquicondor commented May 10, 2024

tenzen-y commented May 10, 2024

tenzen-y commented May 10, 2024

alculquicondor commented May 16, 2024

tenzen-y commented May 17, 2024

tenzen-y commented May 9, 2024 •

edited