Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add CustomResourceMonitor CRD (WIP) #2049

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

CatherineF-dev
Copy link
Contributor

@CatherineF-dev CatherineF-dev commented Apr 17, 2023

What this PR does / why we need it:
What: A more user-friendly UX to collect custom resource metrics.
Why:

  1. Using CustomResourceState monitoring feature is not user-friendly. Current ways on configuring CustomResourceState monitoring are:
  • --custom-resource-state-config "inline yaml (see example)" or
  • --custom-resource-state-config-file /path/to/config.yaml
  1. Current CustomResourceState monitoring feature doesn't support multiple configuration files.
    For example, for a company with 10 teams, each team wants to collect CustomResourceState metrics for their owned Custom Resources.

See more in design doc.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality) no

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2050 #1948

Design doc is ready to review while codes are proof-of-concepts.
Tested: 1. apply examples/cr.yaml 2. apply examples/cr2.yaml 3. delete two yamls.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 17, 2023
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 17, 2023
@CatherineF-dev
Copy link
Contributor Author

fyi: @chrischdi

@CatherineF-dev CatherineF-dev changed the title Add Custom Resource Metrics CRD Add CustomResourceMonitor CRD Apr 17, 2023
@CatherineF-dev CatherineF-dev changed the title Add CustomResourceMonitor CRD [WIP] Add CustomResourceMonitor CRD Apr 17, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2023
@CatherineF-dev CatherineF-dev force-pushed the custom-resource-crd branch 4 times, most recently from 0169068 to 88613ae Compare April 17, 2023 21:53
@rexagod
Copy link
Member

rexagod commented Apr 18, 2023

Hey, thank you for working on this. Just a question right off the top of my mind, would this be better suited to be in a KSM operator? I remember a discussion started my @mrueg in the room highlighting the need for this feature a while back but we eventually held off on adding it for the sole reason of setting up an operator first, IIRC.

@chrischdi
Copy link
Member

Hey, thank you for working on this. Just a question right off the top of my mind, would this be better suited to be in a KSM operator? I remember a discussion started my @mrueg in the room highlighting the need for this feature a while back but we eventually held off on adding it for the sole reason of setting up an operator first, IIRC.

Let's move that discussion over to the issue itself #1948, happy to collect the facts and opinions there to properly document a decision :-)

Thanks @CatherineF-dev , I'll try to figure out time next week after kubecon to take a look at it 👍 🥳

@CatherineF-dev
Copy link
Contributor Author

CatherineF-dev commented Apr 18, 2023

Hey, thank you for working on this. Just a question right off the top of my mind, would this be better suited to be in a KSM operator? I remember a discussion started my @mrueg in the room highlighting the need for this feature a while back but we eventually held off on adding it for the sole reason of setting up an operator first, IIRC.

  1. The reason why for this PR is that coupled custom resources metrics targets and kube-state-metrics deployment causes managing custom resource configuration not easy. And then I found a CRD can handle this issue and also solve Read custom resource metrics configuration from a Custom Resource Definition #1948.

Currently, one of these flags is added into kube-state-metrics deployment.

  • --custom-resource-state-config "inline yaml (see example)" or
  • --custom-resource-state-config-file /path/to/config.yaml

For example, for a company with two teams (monitoring platform team and application team), application team needs to change kube-state-metrics deployment managed by monitoring platform team.

  1. This isn't make KSM as KSM operator fully. It won't add CRD for non custom resource metrics. I agree that CRD makes control KSM memory harder, since new metrics might be added overtime. A flag --custom_resources_ksm_cr_watched_labels is added to handle this case.

  2. KSM already watches pods/endpoints and reconciles around autosharding. This PR only adds one additional watch on CustomResourceMonitor CR and update custom metrics store dynamically.


```yaml
# example cr
apiVersion: customresource.ksm.io/v1alpha1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find *.k8s.io needs additional approvals. Planned to use customresource.ksm.k8s.io before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer *.k8s.io group for the CR given we exist under the same organization here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2023
@CatherineF-dev
Copy link
Contributor Author

Hi @dgrisonnet, could you help review this design doc when you're available? Thanks!

Related issues #2050 #1948

@dgrisonnet
Copy link
Member

Sure, I'll try to catch up on the various discussion asap

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@CatherineF-dev CatherineF-dev force-pushed the custom-resource-crd branch 2 times, most recently from ad5da8a to b41f7d0 Compare June 25, 2023 14:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2023
@CatherineF-dev CatherineF-dev changed the title [WIP] Add CustomResourceMonitor CRD feat: add CustomResourceMonitor CRD (WIP) Jun 25, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2023
@dgrisonnet
Copy link
Member

/triage accepted
/assign

Apart from existing two flags (`--custom-resource-state-config ` and `--custom-resource-state-config-file`), these three flags will be added:
* `--custom_resource_monitor_enabled`: whether watch CustomResourceMonitor CRs or not.
* `--custom_resource_monitor_labels`: only watch CustomResourceMonitor with label=x. It's used to avoid double custom metrics collection when multiple kube-state-metrics are installed.
* `--custom_resource_monitor_namespaces`: only watch CustomResourceMonitor under namespaces=x.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider a denylist for namespaces as well?
In case someone has a "platform KSM" (only selected namespaces) and a "product KSM" (everything else)?

@CatherineF-dev CatherineF-dev force-pushed the custom-resource-crd branch 2 times, most recently from e30e223 to c2f9fee Compare July 9, 2023 19:49
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2023
@guettli
Copy link

guettli commented Sep 4, 2023

@CatherineF-dev I think it would be createa to have a CustomResourceMonitor CRD. Are you still working on this PR?

If you think this PR is not needed any more, because you found an alternative solution, please let me know. I guess we have the same goal.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2024
@rexagod
Copy link
Member

rexagod commented Jan 21, 2024

@CatherineF-dev Are you working on this? I see how this could be a large-enough issue entailing multiple cases to warrant more hands, and if so, feel free to bring this up in the next SIG call and we can have more folks contributing to your fork, if that sounds good!

@CatherineF-dev
Copy link
Contributor Author

CatherineF-dev commented Jan 22, 2024

Yes, from previous discussion, I remember we plan to split KSM and KSM Custom Metrics into 2 binaries first. Then add CRD feature.

So I haven't resolved conflicts in this PR. Waiting 2 binaries.

(This prototype was working in my k8s cluster)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024
@rexagod
Copy link
Member

rexagod commented Feb 25, 2024

I believe we can get this in regardless, and include this change whenever the split happens. Getting this one in would resolve a lot of issues, in addition to deserving its own minor release.

I'm +1 for making this happen pre-split, since we don't plan on doing that in the immediate future.

@rexagod
Copy link
Member

rexagod commented Feb 25, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decouple custom resource config with kube-state-metrics deployment
10 participants