scheduler-perf: create any object from YAML #122419

pohly · 2023-12-20T10:39:13Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

With a dynamic client and a rest mapper it is possible to load arbitrary YAML
files and create the object defined by it. This is simpler than adding specific
Go code for each supported type.

Special notes for your reviewer:

This is a more flexible alternative to #122391.

Because the version now matters, the incorrect version in the DRA YAMLs were
found and fixed.

Does this PR introduce a user-facing change?

none

/assign @kerthcet

k8s-ci-robot · 2023-12-20T10:39:23Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2023-12-21T13:28:45Z

/retest

pohly · 2024-02-02T16:26:23Z

/hold

To avoid automatic merge on LGTM.

kerthcet · 2024-02-05T09:33:58Z

/assign

pohly · 2024-02-05T12:59:29Z

test/utils/ktesting/ktesting.go

+// The type of that context is still context.Context because replacing
+// it with TContext breaks tests which use `WithCancel`.
+//
+// TODO(pohly): change all of that code together with changing the return type.


I have a follow-up commit that I intend to submit in a separate PR which does that across the code base:

$ git log -p -n1 20a10317bbb1f81580f532d85aa94c46539233c2 | diffstat pkg/controller/disruption/disruption_test.go | 12 pkg/kubelet/config/config_test.go | 60 --- test/integration/apiserver/apiserver_test.go | 39 -- test/integration/apiserver/certreload/certreload_test.go | 20 - test/integration/apiserver/flowcontrol/concurrency_test.go | 9 test/integration/apiserver/flowcontrol/concurrency_util_test.go | 11 test/integration/apiserver/max_json_patch_operations_test.go | 11 test/integration/apiserver/max_request_body_bytes_test.go | 46 +- test/integration/apiserver/openapi/openapi_enum_test.go | 8 test/integration/apiserver/openapi/openapiv3_test.go | 14 test/integration/apiserver/podlogs/podlogs_test.go | 11 test/integration/apiserver/watchcache_test.go | 34 - test/integration/auth/accessreview_test.go | 26 - test/integration/auth/auth_test.go | 74 +--- test/integration/auth/bootstraptoken_test.go | 8 test/integration/auth/dynamic_client_test.go | 9 test/integration/auth/rbac_test.go | 48 +- test/integration/auth/selfsubjectreview_test.go | 22 - test/integration/auth/svcaccttoken_test.go | 78 ++-- test/integration/controlplane/synthetic_controlplane_test.go | 7 test/integration/daemonset/daemonset_test.go | 18 - test/integration/defaulttolerationseconds/defaulttolerationseconds_test.go | 10 test/integration/disruption/disruption_test.go | 92 ++--- test/integration/dualstack/dualstack_endpoints_test.go | 30 - test/integration/dualstack/dualstack_test.go | 180 ++++------ test/integration/evictions/evictions_test.go | 88 ++-- test/integration/examples/webhook_test.go | 12 test/integration/framework/test_server.go | 1 test/integration/network/services_test.go | 60 +-- test/integration/scheduler_perf/scheduler_test.go | 2 test/integration/serviceaccount/service_account_test.go | 56 +-- test/integration/servicecidr/allocator_test.go | 20 - test/integration/util/util.go | 9 test/utils/ktesting/ktesting.go | 8 34 files changed, 447 insertions(+), 686 deletions(-)

Overall the code becomes shorter. The changes are like this one:

diff --git a/pkg/controller/disruption/disruption_test.go b/pkg/controller/disruption/disruption_test.go index 3e9fb4e4fd3..223ac8dc3d8 100644 --- a/pkg/controller/disruption/disruption_test.go +++ b/pkg/controller/disruption/disruption_test.go @@ -1538,12 +1538,10 @@ func TestStalePodDisruption(t *testing.T) { } for name, tc := range cases { t.Run(name, func(t *testing.T) { - _, ctx := ktesting.NewTestContext(t) - ctx, cancel := context.WithCancel(ctx) - defer cancel() - dc, _ := newFakeDisruptionControllerWithTime(ctx, now) - go dc.Run(ctx) - if _, err := dc.coreClient.CoreV1().Pods(tc.pod.Namespace).Create(ctx, tc.pod, metav1.CreateOptions{}); err != nil { + tCtx := ktesting.Init(t) + dc, _ := newFakeDisruptionControllerWithTime(tCtx, now) + go dc.Run(tCtx) + if _, err := dc.coreClient.CoreV1().Pods(tc.pod.Namespace).Create(tCtx, tc.pod, metav1.CreateOptions{}); err != nil { t.Fatalf("Failed to create pod: %v", err) } dc.clock.Sleep(tc.timePassed) @@ -1552,7 +1550,7 @@ func TestStalePodDisruption(t *testing.T) { } diff := "" if err := wait.Poll(100*time.Millisecond, wait.ForeverTestTimeout, func() (bool, error) { - pod, err := dc.kubeClient.CoreV1().Pods(tc.pod.Namespace).Get(ctx, tc.pod.Name, metav1.GetOptions{}) + pod, err := dc.kubeClient.CoreV1().Pods(tc.pod.Namespace).Get(tCtx, tc.pod.Name, metav1.GetOptions{}) if err != nil { t.Fatalf("Failed getting updated pod: %v", err) }

The decision to use tCtx instead of ctx become part of our "coding style". It would work to not rename the variable, then it's less clear when a ctx is a context.Context and when it is a ktesting.TContext. I prefer using tCtx.

kerthcet

Didn't review the last commit yet.

kerthcet · 2024-02-07T09:17:18Z

test/integration/scheduler_perf/dra.go

 }
 }

-// createResourceClassOpType customizes createOp for creating a ResourceClass.


kerthcet · 2024-02-07T09:42:25Z

test/integration/scheduler_perf/create.go

+ create := func() error {
+ mapping, err := restMapper.RESTMapping(gk, gv.Version)
+ if err != nil {
+ // Cached mapping might be stale, refresh on next try.


We can directly return here? Because we did the same thing in RESTMapping.

We have to call restMapper.Reset() before returning. Otherwise restMapper will never refresh itself.

Call Reset() here is harmless, but doesn't the RESTMapping already covered your scenario? It will call the Reset when failed to find the resource. (Actually I didn't use this function a lot ...)

func (d *DeferredDiscoveryRESTMapper) RESTMapping(gk schema.GroupKind, versions ...string) (m *meta.RESTMapping, err error) { del, err := d.getDelegate() if err != nil { return nil, err } m, err = del.RESTMapping(gk, versions...) if err != nil && !d.cl.Fresh() { d.Reset() m, err = d.RESTMapping(gk, versions...) } return }

Note the !d.cl.Fresh(): this depends on the discovery.CachedDiscoveryInterface implementation detecting "stale" data.

If I remember correctly, it doesn't do that just because some result is not found and the Reset in the code above was needed.

kerthcet · 2024-02-07T09:53:04Z

test/integration/scheduler_perf/scheduler_perf.go

+ createNamespacesOpcode operationCode = "createNamespaces"
+ createPodsOpcode operationCode = "createPods"
+ createPodSetsOpcode operationCode = "createPodSets"
+ createResourceClaimsOpcode operationCode = "createResourceClaims"


Can we remove this as well? Or we can't create from yaml.

This needs additional logic which goes beyond "create one object from YAML", therefore this has to stay.

kerthcet · 2024-02-07T10:01:52Z

test/integration/scheduler_perf/scheduler_perf.go

@@ -674,6 +674,11 @@ func RunBenchmarkPerfScheduling(b *testing.B, outOfTreePluginRegistry frameworkr
 if !enabled(*perfSchedulingLabelFilter, append(tc.Labels, w.Labels...)...) {
 b.Skipf("disabled by label filter %q", *perfSchedulingLabelFilter)
 }
+ tCtx := ktesting.Init(b)
+ if !*useTestingLog {
+ // Use the normal global logger instead of the per-test logger initialized by Init.


Can we move this into Init, users will not quite care about the implementation.

No, because if we don't use per-test logging, then it has to be done here by reverting the default behavior of ktesting.Init, which is per-test logging.

I probably misunderstood the question: yes, this could be done in ktesting.Init if it had options. So far, this is the only usage where per-test output wasn't desirable, but perhaps there'll be others. I'll add a WithPerTestOutput(enabled bool) and use that here.

I prefer to not put a seldomly used option into the main ktesting package. Therefore I created "ktesting/initoption" and placed PerTestOutput there. Because it's a package specifically for functional options, there is no need to add the With prefix. That is used instead for the functions which create a new context.

@kerthcet: can you perhaps take another look? I just pushed.

kerthcet · 2024-02-07T10:04:35Z

test/integration/scheduler_perf/scheduler_perf.go

- ctx, cancel := context.WithTimeout(ctx, 30*time.Minute)
- b.Cleanup(cancel)
+ timeout := 30 * time.Minute
+ tCtx = ktesting.WithTimeout(tCtx, timeout, fmt.Sprintf("timed out after the %s per-test timeout", timeout))


Suggested change

tCtx = ktesting.WithTimeout(tCtx, timeout, fmt.Sprintf("timed out after the %s per-test timeout", timeout))

tCtx = ktesting.WithTimeout(tCtx, timeout, fmt.Sprintf("timed out after the %s per-test", timeout))

This would lead to a message saying "timed out after the 30s per-test".

I think we should keep "per-test timeout" in the message.

kerthcet · 2024-02-07T10:29:28Z

test/integration/scheduler_perf/scheduler_perf.go

- case <-ctx.Done():
- tb.Fatalf("op %d: %v", opIndex, ctx.Err())
+ case <-tCtx.Done():
+ tCtx.Fatalf("op %d: %v", opIndex, context.Cause(tCtx))


Q: why not implement the same method?

The same method as what? Sorry, I don't follow.

sorry, mixed with tCtx, forgot this.

kerthcet · 2024-02-07T10:36:57Z

test/integration/scheduler_perf/util.go

+ // No alpha APIs (overrides api/all=true in https://github.com/kubernetes/kubernetes/blob/d647d19f6aef811bace300eec96a67644ff303d4/staging/src/k8s.io/apiextensions-apiserver/pkg/cmd/server/testing/testserver.go#L136),
+ // except for DRA API group when needed.
+ runtimeConfig := []string{"api/alpha=false"}
+ if enabledFeatures[features.DynamicResourceAllocation] {


Specify the features is ugly but we can improve later since not related with this PR.

kerthcet · 2024-02-07T10:38:43Z

test/integration/scheduler_perf/util.go

- util.StartFakePVController(ctx, client, informerFactory)
- runGC := util.CreateGCController(ctx, tb, *cfg, informerFactory)
- runNS := util.CreateNamespaceController(ctx, tb, *cfg, informerFactory)
+ _, informerFactory := util.StartScheduler(tCtx, tCtx.Client(), cfg, config, outOfTreePluginRegistry)


We have several duplicated parameters here.

Yes, there's definitely room for future refactoring/simplification of test/integration/util to get rid of redundant parameters like this one here. I'm gradually going to do more of those, but wanted to start small(ish) initially.

kerthcet · 2024-02-07T10:39:57Z

test/integration/scheduler_perf/util.go

 go runGC()
 go runNS()
 go runResourceClaimController()

- return informerFactory, client, dynClient
+ return informerFactory, tCtx


I'm thinking can we not return the ctx again here, but this definitely needs some refactoring of TContext.

mustSetupCluster gets a TContext without Kubernetes clients and creates a new TContext which has them, so returning that here is okay.

TContext instances are intentionally immutable, i.e. you cannot change one instance by changing some of its attributes. That is the same design that logr.Logger and context.Context are using. I think it make sense and has some nice properties (no race conditions although there is no locking, prevents callees from changing something that is used by their caller, "composing" contexts by wrapping a base context).

make very sense to me.

The new TContext interface combines a normal context and the testing interface, then adds some helper methods. The context gets canceled when the test is done, but that can also be requested earlier via Cancel. The intended usage is to pass a single `tCtx ktesting.TContext` parameter around in all helper functions that get called by a unit or integration test. Logging is also more useful: Log[f] and Fatal[f] output is prefixed with "[FATAL] ERROR: " to make it stand out more from regular log output. If this approach turns out to be useful, it could be extended further (for example, with a per-test timeout) and might get moved to a staging repository to enable usage of it in other staging repositories. To allow other implementations besides testing.T and testing.B, a custom ktesting.TB interface gets defined with the methods expected from the actual implementation. One such implementation can be ginkgo.GinkgoT().

ktesting.TContext combines several different interfaces. This makes the code simpler because less parameters need to be passed around. An intentional side effect is that the apiextensions client interface becomes available, which makes it possible to use CRDs. This will be needed for future DRA tests. Support for CRDs depends on starting the apiserver via k8s.io/kubernetes/cmd/kube-apiserver/app/testing because only that enables the CRD extensions. As discussed on Slack, the long-term goal is to replace the in-tree StartTestServer with the one in staging, so this is going in the right direction.

With a dynamic client and a rest mapper it is possible to load arbitrary YAML files and create the object defined by it. This is simpler than adding specific Go code for each supported type. Because the version now matters, the incorrect version in the DRA YAMLs were found and fixed.

kerthcet

Only one question, otherwise LGTM for scheduler.
/approve

kerthcet · 2024-02-13T14:57:09Z

test/integration/scheduler_perf/create.go

+ create := func() error {
+ mapping, err := restMapper.RESTMapping(gk, gv.Version)
+ if err != nil {
+ // Cached mapping might be stale, refresh on next try.


Call Reset() here is harmless, but doesn't the RESTMapping already covered your scenario? It will call the Reset when failed to find the resource. (Actually I didn't use this function a lot ...)

func (d *DeferredDiscoveryRESTMapper) RESTMapping(gk schema.GroupKind, versions ...string) (m *meta.RESTMapping, err error) { del, err := d.getDelegate() if err != nil { return nil, err } m, err = del.RESTMapping(gk, versions...) if err != nil && !d.cl.Fresh() { d.Reset() m, err = d.RESTMapping(gk, versions...) } return }

kerthcet · 2024-02-13T15:00:38Z

test/integration/scheduler_perf/scheduler_perf.go

- case <-ctx.Done():
- tb.Fatalf("op %d: %v", opIndex, ctx.Err())
+ case <-tCtx.Done():
+ tCtx.Fatalf("op %d: %v", opIndex, context.Cause(tCtx))


sorry, mixed with tCtx, forgot this.

kerthcet · 2024-02-13T15:03:45Z

test/integration/scheduler_perf/util.go

 go runGC()
 go runNS()
 go runResourceClaimController()

- return informerFactory, client, dynClient
+ return informerFactory, tCtx


make very sense to me.

k8s-ci-robot · 2024-02-16T15:40:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kerthcet · 2024-02-16T15:45:04Z

Sorry I thought I had time to review this during my holiday, but it's always out of control.

kerthcet · 2024-02-16T16:10:52Z

test/utils/ktesting/contexthelper.go

+ now := time.Now()
+
+ cancelCtx, cancel := context.WithCancelCause(ctx)
+ after := time.NewTimer(timeout)


Should we use the same deadline as below, the sooner one, rather than the given timeout.

There's no need for that because if the underlying ctx has a deadline which is sooner than the new timeout, then the resulting context will get canceled via the underlying ctx. Wrapping it doesn't remove that.

Below we just need to be more careful because the Deadline implementation is the one provided by the new context.

kerthcet · 2024-02-16T16:44:23Z

test/utils/ktesting/tcontext.go

+ if deadlineOK {
+ if deadline, ok := deadlineTB.Deadline(); ok {
+ timeLeft := time.Until(deadline)
+ timeLeft -= CleanupGracePeriod


What will happen if timeLeft is negative here? Why not add the cleanupGracePeriod here? That's say when timeout, we'll have additional grace period for cleanup.

What will happen if timeLeft is negative here?

withTimeout and it's underlying time.NewTimer allow negative time left. It simply triggers immediately.

Why not add the cleanupGracePeriod here? That's say when timeout, we'll have additional grace period for cleanup.

We need to stop the normal test code CleanupGracePeriod before the final deadline (hence subtract it here). Then if the test code reacts immediately to the context cancellation, there will be CleanupGracePeriod left for the registered cleanup callbacks.

kerthcet · 2024-02-16T16:55:53Z

test/utils/ktesting/contexthelper.go

+
+ cancelCtx, cancel := context.WithCancelCause(ctx)
+ after := time.NewTimer(timeout)
+ stopCtx, stop := context.WithCancel(ctx) // Only used internally, doesn't need a cause.


Any special reason to have another cancel context here? Seems cancelContext is enough. I'm ok to leave it here just want to know any reason I can't get here.

Without stopCtx, the anonymous function below would block only on after.C. It could also block on the parent ctx.Done(), but both of these have the same problem: they don't trigger when the test is complete. That would leave the goroutine running after the test is done.

We need the anonymous function for the situation that the timeout triggers while the test still runs.

kerthcet · 2024-02-17T14:56:50Z

For scheduling, we have some refactor works left, e.g. remove the duplicated parameters, otherwise, LGTM. For ktesting, I just have several questions, would you want someone else to look again or we'll just go forward since we use in scheduler-perf only right now. @pohly

pohly · 2024-02-19T08:21:46Z

@kerthcet: your review of ktesting was very thorough, thanks! Assuming that I have answered your question sufficiently, I suggest that we simply move ahead and merge it without further reviews. Your formal lgtm will be sufficient.

kerthcet · 2024-02-19T08:39:44Z

/lgtm

k8s-ci-robot · 2024-02-19T08:39:49Z

LGTM label has been added.

Git tree hash: 61d0dba393d9824bb7d220e9ff8a6ae0a42d0274

pohly · 2024-02-19T14:39:12Z

Let's merge it.

/hold cancel

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Dec 20, 2023

k8s-ci-robot assigned kerthcet Dec 20, 2023

k8s-ci-robot requested review from chendave and damemi December 20, 2023 10:40

pohly force-pushed the scheduler-perf-create-any branch 2 times, most recently from 41bfa9f to 76fbf73 Compare December 20, 2023 11:16

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2024

pohly changed the title ~~scheduler-perf: create any object from YAML~~ WIP: scheduler-perf: create any object from YAML Jan 8, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2024

pohly force-pushed the scheduler-perf-create-any branch from 76fbf73 to 616621f Compare January 8, 2024 13:36

pohly force-pushed the scheduler-perf-create-any branch from 616621f to 1dd2d01 Compare February 1, 2024 19:40

pohly changed the title ~~WIP: scheduler-perf: create any object from YAML~~ scheduler-perf: create any object from YAML Feb 1, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 1, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2024

bart0sh added this to Triage in SIG Node PR Triage Feb 2, 2024

pohly commented Feb 5, 2024

View reviewed changes

pohly mentioned this pull request Feb 5, 2024

kube-controller-manager: readjust log verbosity #122293

Merged

kerthcet reviewed Feb 7, 2024

View reviewed changes

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board Feb 7, 2024

SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Feb 7, 2024

pohly force-pushed the scheduler-perf-create-any branch from 3a86200 to 74c8366 Compare February 9, 2024 20:01

pohly added 3 commits February 11, 2024 10:51

pohly force-pushed the scheduler-perf-create-any branch from 74c8366 to da0c9a9 Compare February 11, 2024 09:51

bart0sh moved this from Triage to Done in SIG Node PR Triage Feb 12, 2024

kerthcet reviewed Feb 16, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2024

k8s-ci-robot merged commit 64386c5 into kubernetes:master Feb 19, 2024
18 checks passed

SIG Node CI/Test Board automation moved this from Archive-it to Done Feb 19, 2024

k8s-ci-robot added this to the v1.30 milestone Feb 19, 2024

pohly mentioned this pull request Feb 22, 2024

WIP: scheduler_perf: add support for CRDs #122391

Closed

	tCtx = ktesting.WithTimeout(tCtx, timeout, fmt.Sprintf("timed out after the %s per-test timeout", timeout))
	tCtx = ktesting.WithTimeout(tCtx, timeout, fmt.Sprintf("timed out after the %s per-test", timeout))

scheduler-perf: create any object from YAML #122419

scheduler-perf: create any object from YAML #122419

Conversation

pohly commented Dec 20, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Dec 20, 2023

pohly commented Dec 21, 2023

pohly commented Feb 2, 2024

kerthcet commented Feb 5, 2024

Choose a reason for hiding this comment

kerthcet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerthcet left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 16, 2024

kerthcet commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerthcet commented Feb 17, 2024

pohly commented Feb 19, 2024

kerthcet commented Feb 19, 2024

k8s-ci-robot commented Feb 19, 2024

pohly commented Feb 19, 2024

pohly commented Dec 20, 2023 •

edited

kerthcet left a comment •

edited