Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForPodsReady: Store last requeued count and time #2175

Open
3 tasks done
tenzen-y opened this issue May 9, 2024 · 16 comments
Open
3 tasks done

WaitForPodsReady: Store last requeued count and time #2175

tenzen-y opened this issue May 9, 2024 · 16 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tenzen-y
Copy link
Member

tenzen-y commented May 9, 2024

What would you like to be added:
Depend on #2174

Since #2063, the workload controller resets the .status.requeueState.requeueAt if the requeueAt exceeds the current time.
Also, we will reset the .status.requeueState.count as well to fix the bug reported in #2174.

So, I would like to propose the dedicated APIs that do not involve scheduling like this:

[...]
status:
  lastRequeueState:
    requeueAt: $TIME
    count: 3
[...]

Why is this needed:
As initially designed, the requeueState is responsible for storing the last requeued time and counting and notifying the users as well.
But, to avoid the race condition, we dropped/will drop the functionality from those APIs (.status.requeueState) in the Workload.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@tenzen-y tenzen-y added the kind/feature Categorizes issue or PR as related to a new feature. label May 9, 2024
@tenzen-y
Copy link
Member Author

tenzen-y commented May 9, 2024

cc: @alculquicondor @mimowo

@alculquicondor
Copy link
Contributor

Since #2063, the workload controller resets the .status.requeueState.requeueAt if the requeueAt exceeds the current time.

What about just reverting that?

Would it be enough?

@alculquicondor
Copy link
Contributor

I would be worried about adding more fields that could just cause confusion.

But maybe it isn't too bad.

WDYT @mimowo?

@tenzen-y
Copy link
Member Author

@alculquicondor Sorry for the confusion. Actually, we need to reset the .status.requeueState.count field to avoid the race condition. So, could you check #2174 first? Thanks.

@mimowo
Copy link
Contributor

mimowo commented May 10, 2024

I would be worried about adding more fields that could just cause confusion.

But maybe it isn't too bad.

WDYT @mimowo?

I think we need to reset the .status.requeueState.count field indeed, because currently when an admin re-activates the workload it gets re-evicted immediately. So the admin needs to first clean .status.requeueState.count , then activate: spec.active=true. This is extra step and room for error.

Now, if we clear .status.requeueState.count it will work, but as discussed offline with @tenzen-y the admins may want to know how many retries it took before deactivating. For that, the minimal approach would be to just record the count the in the message for PodsReadyTimeout, say Exceeded the PodsReady timeout %s -> The PodsReady timeout %s was exceeded %v times in a row. And I would hope this message is enough. WDYT @tenzen-y ?

If the message is not enough, because some automation wants to consume the information about the number of retries, then I think the proposal with status.lastRequeueState makes sense. We do something similar for pods with lastTerminationState link.

@alculquicondor
Copy link
Contributor

I like the idea of just updating the condition message. The users just need a quick signal to understand what happened.

@tenzen-y
Copy link
Member Author

Now, if we clear .status.requeueState.count it will work, but as discussed offline with @tenzen-y the admins may want to know how many retries it took before deactivating. For that, the minimal approach would be to just record the count the in the message for PodsReadyTimeout, say Exceeded the PodsReady timeout %s -> The PodsReady timeout %s was exceeded %v times in a row. And I would hope this message is enough. WDYT @tenzen-y ?

My motivation is to provide a machine-readable state. So, I would prefer to have lastRequeueState.
But, as I mentioned here, when we deactivate the Workload, adding Evicted condition instead of modifying the .spec.active field allows us to avoid the race condition and avoid to add a new .status.lastRequeueState field.

@alculquicondor @mimowo Regarding this idea, WDYT?

@mimowo
Copy link
Contributor

mimowo commented May 10, 2024

My motivation is to provide a machine-readable state.

Do you have a concrete use case where this information would be parsed by automation? If not, then it can introduce some form of keeping the information structured (like lastRequeueState) when requested by users.

@alculquicondor @mimowo Regarding #2174 (comment) idea, WDYT?

I'm not sure. Wouldn't we clear the status.requeuingState.count in that case?

@mimowo
Copy link
Contributor

mimowo commented May 10, 2024

Actually, if we need this information structured I would be leaning towards using lastRequeuingState, by analogy to lastTerminationState API. Seems cleaner to have dedicated API for this purpose than overload requeuingState with two responsibilities: driving the mechanism, and serve for the structured reason of deactivation.

@alculquicondor
Copy link
Contributor

+1 to not reuse, but I still want to know why would automation need to know all of this details? As opposed to just a reason in the Evicted condition.

@tenzen-y
Copy link
Member Author

Do you have a concrete use case where this information would be parsed by automation? If not, then it can introduce some form of keeping the information structured (like lastRequeueState) when requested by users.

In the platform engineering context, the admins (SWE/Ops/SRE) often develop and provide common platforms across the company to users (Researcher/DS/ML Engineer).
In that case, we wouldn't provide permissions to operate Kubernetes to users since the users often don't know Kubernetes, and also we would handle security policy (similarly IAM concept).

So, we often provide in-house CLI, and Concole wrapped to allow users to operate Jobs (Create/List/Delete).
In that internal platform, we would notify users the requeueState.count and requeueState.requeueAt via in-house API and CLI, and Console.

Therefore, I would like to provide machine-readable API via Workload resources.

@alculquicondor
Copy link
Contributor

In that case, let's go with the dedicated API

@tenzen-y
Copy link
Member Author

In that case, let's go with the dedicated API

As you are pointing out here, it seems that we can not avoid resetting the requeueState...

So, if @alculquicondor and @mimowo are ok, I would like to add a dedicated API (.status.lastRequeueState).

@tenzen-y
Copy link
Member Author

/assign

@alculquicondor
Copy link
Contributor

In addition to the lastRequeueState, I was wondering if it's worth also adding an Evicted condition right before deactivating the Workload?
That could hold a proper reason.
Otherwise, if we just deactivate, this would create its own Evicted condition with a reason Deactivated

@tenzen-y
Copy link
Member Author

In addition to the lastRequeueState, I was wondering if it's worth also adding an Evicted condition right before deactivating the Workload? That could hold a proper reason. Otherwise, if we just deactivate, this would create its own Evicted condition with a reason Deactivated

Yeah, I also think it would be worth a dedicated reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants