Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix scheduler version skew handling for in-place pod resize #118843

Closed
wants to merge 1 commit into from

Conversation

wlp1153468871
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

If PodStatus.Resize field is not empty, scheduler uses max(ResourcesAllocated, Requests) even if the feature-gate is disabled.

Which issue(s) this PR fixes:

Fixes #117767

Special notes for your reviewer:

Does this PR introduce a user-facing change?

 NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources#version-skew-strategy

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 24, 2023
@wlp1153468871 wlp1153468871 force-pushed the wlp0624 branch 2 times, most recently from ac1080f to 3687c64 Compare June 24, 2023 07:39
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wlp1153468871
Once this PR has been reviewed and has the lgtm label, please assign deads2k, sanposhiho for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -720,7 +720,7 @@ func TestPodResourceRequests(t *testing.T) {
{
description: "resized, infeasible",
expectedRequests: v1.ResourceList{
v1.ResourceCPU: resource.MustParse("2"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing these resource size in tests?🤔️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the chapter Version Skew Strategy of the KEP-1287, when kube-scheduler computes the pod resource requests, it should use max(AllocatedResources, Requests) if PodStatus.Resize != "".

In this test case, the PodStatus.Resize is equal to Infeasible and the AllocatedResources is less than Requests, so the expected result is Requests. I'm not sure that this change is reasonable when the PodStatus.Resize is equal to Infeasible. If I misunderstood, please correct me. Thanks.

FYI:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, sgtm

@wlp1153468871
Copy link
Member Author

@vinaykul Can you help me review a PR? Thank you

@alexzielenski
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 27, 2023
@alexzielenski
Copy link
Contributor

untagged mistakenly

/sig api-machinery
/triage accepted
/cc @deads2k

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2023
@wlp1153468871
Copy link
Member Author

@mikedanese @sttts Can you help me review a PR? Thank you

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2023
@vinaykul
Copy link
Contributor

@wlp1153468871 Thank you very much for picking this up! Could we please fix both kubelet and scheduler fix in a single PR so that it fully resolves the issue?

We should add an e2e test for this.. maybe we can extend test/e2e/node/pod_resize.go
In order to exercise InPlacePodVerticalScaling feature-gate enabled in apiserver but disabled in kubelet and scheduler, one way might be to clone this CI job into two other CI jobs (maybe with 72 hour cadence) and set this FG to false using --feature-gates flag passed via KUBELET_TEST_ARGS and SCHEDULER_TEST_ARGS.

@bobbypage @BenTheElder Does that above suggestion look reasonable? (Will setting the --feature-gates flag override for kubelet & scheduler? If not, would the inverse work?)

I'm also wondering if an e2e skew testing CI job already exists. It sounds a boilerplate version skew test job for our version skew support matrix is something many alpha and beta features could use, and in general to detect skew regressions. I do see references to --check-version-skew flag in test-infra repo but my git grep shows them all set to false.

@wlp1153468871
Copy link
Member Author

@wlp1153468871 Thank you very much for picking this up! Could we please fix both kubelet and scheduler fix in a single PR so that it fully resolves the issue?

We should add an e2e test for this.. maybe we can extend test/e2e/node/pod_resize.go In order to exercise InPlacePodVerticalScaling feature-gate enabled in apiserver but disabled in kubelet and scheduler, one way might be to clone this CI job into two other CI jobs (maybe with 72 hour cadence) and set this FG to false using --feature-gates flag passed via KUBELET_TEST_ARGS and SCHEDULER_TEST_ARGS.

@bobbypage @BenTheElder Does that above suggestion look reasonable? (Will setting the --feature-gates flag override for kubelet & scheduler? If not, would the inverse work?)

I'm also wondering if an e2e skew testing CI job already exists. It sounds a boilerplate version skew test job for our version skew support matrix is something many alpha and beta features could use, and in general to detect skew regressions. I do see references to --check-version-skew flag in test-infra repo but my git grep shows them all set to false.

Thank you very much for your reply. However, I have noticed that pbialon has already claimed the task related to kubelet. I would be grateful to hear your thoughts on this matter, @bobbypage @BenTheElder

@wlp1153468871 wlp1153468871 reopened this Jul 18, 2023
@k8s-ci-robot
Copy link
Contributor

@wlp1153468871: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-integration 727d8ce link true /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vinaykul
Copy link
Contributor

Thank you very much for your reply. However, I have noticed that pbialon has already claimed the task related to kubelet. I would be grateful to hear your thoughts on this matter, @bobbypage @BenTheElder

That's ok, we can have two PRs. The core code change is quite simple in this but the idea is for new contributors to get a broader view of the upgrade scenario considerations and how it can be exercised in CI so it two contributors get to learn, twice as better.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FG:InPlacePodVerticalScaling] Implement version skew handling for in-place pod resize
7 participants