Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: NodeFeatureGroup API (CRD) #1423

Closed
ArangoGutierrez opened this issue Oct 20, 2023 · 28 comments · Fixed by #1487
Closed

KEP: NodeFeatureGroup API (CRD) #1423

ArangoGutierrez opened this issue Oct 20, 2023 · 28 comments · Fixed by #1487
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster.

Comments

@ArangoGutierrez
Copy link
Contributor

ArangoGutierrez commented Oct 20, 2023

Summary

The Kubernetes cluster object doesn't expose all available features in a programmatic way.

When working in a MultiCluster environment (example kcp, hypershift ) the central control plane can not access all the available features on each cluster, making it hard to take scheduling and management decisions.

The Node-Feature-Discovery does a good work for exposing a per-node basis feature inventory but querying each cluster in a per-node basis can be a network intensive task. Various use cases have been identified where having a cluster inventory would facilitate operations at the Cluster management level.

This KEP proposes NFD to expose an inventory of available features in the cluster via a new API (CRD)

Goals

  • make the information about all clusters easy to query via a centralised API
  • expose cluster wide features currently not reported by NFD in a per node scenario, e.g Network config

Non-Goals

  • change existing behaviour at the node level
  • To be a MultiCluster management tool, this API is to expose NFD discovered features via a single API (CRD)

Proposal

User Stories

Story 1

As a platform engineer, I want to known the available features on each cluster registered on my network to be able to make optimal, platform specific, scheduling decisions.

Story 2

As a System-Admin I want a single API to know the available features of each cluster on the network.

resource allocations.

CRD API

// NodeFeatureGroup resource holds the features discovered for all nodes in a
// cluster.
// +kubebuilder:object:root=true
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// +genclient
type NodeFeatureGroup struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec ClusterFeatureSpec `json:"spec"`
}

// NodeFeatureGroupSpec describes a ClusterFeature object.
type NodeFeatureGroupSpec struct {
	// FeatureGroup is a set of grouped objects by specific features
	// +optional
	FeatureGroup []FeatureGroupSpec `json:"FeatureGroupSpec"`
	// Features is the set of cluster wide features that are not reported at the node level.
	// +optional
	ClusterFeatures []ClusterFeatures`json:"clusterFeatures"`
}
@ArangoGutierrez ArangoGutierrez added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 20, 2023
@ArangoGutierrez
Copy link
Contributor Author

/assign

@zanetworker
Copy link

zanetworker commented Oct 20, 2023

@ArangoGutierrez Do you mind clarifying on make the information about all clusters easy to query via a centralised API?

I am trying to understand what is it you are looking for. There are multiple multi-cluster management tools already available, is this a feature request? or do you have a specific product/project in mind you want to extend?

@ArangoGutierrez
Copy link
Contributor Author

@ArangoGutierrez Do you mind clarifying on make the information about all clusters easy to query via a centralised API?

I am trying to understand what is it you are looking for. There are multiple multi-cluster management tools already available, is this a feature request? or do you have a specific product/project in mind you want to extend?

Sure, as I said

The Node-Feature-Discovery does a good work for exposing a per-node basis feature inventory but querying each cluster in a per-node basis can be a network intensive task

When I refer to a single API, I am saying that instead of having to query all the nodes for the created labels (NFD Labels), the new ClusterFeature CRD will be an aggregator, so application developers can have a controller to Watch for events on a single CRD, and be informed if the cluster got a new node, and the features of said node. This is a new CRD exposed by NFD, to group discovered features, by no means is Yet-Another-Multicluster-management-tool.

@berenss
Copy link

berenss commented Oct 20, 2023

would there be a way that this work intersects with the open-cluster-management project? https://open-cluster-management.io/

@ArangoGutierrez
Copy link
Contributor Author

would there be a way that this work intersects with the open-cluster-management project? https://open-cluster-management.io/

Hey!
No it won't NFD is not a Cluster management tool, our aim is to provide an easy and programatic way to expose all features via CRD's / Labels / annotations, so Developers/users can act on them.
The ClusterFeature CRD will basically be a cluster wide inventory of available resources. NFD discovered resources, are extra from the ones advertised to the Kubelet. We want to be able to host a Cluster wide inventory of specific features, like GPU type, specific CPU features like TDX os SMP, this are not exposed by default Kubernetes tools.

@berenss
Copy link

berenss commented Oct 20, 2023

right! let me get more specific,
within o-c-m project, the placement would intersect quite nicely with NFD to allow workloads to land upon nodes with specific features. in other words, the scaffolding is already there for NFD to become a first class provider into o-c-m's placement. perhaps I need to work this connection from the other side and introduce o-c-m to NFD
https://open-cluster-management.io/scenarios/distribute-workload-with-placement/

@berenss
Copy link

berenss commented Oct 20, 2023

I also got a bit of a heads-up from an engineer that works in o-c-m, and he shared some additional insights

It sounds kind of similar to the Cluster Inventory project that @qiujian16 has been pushing for. It was presented in the sig mc for a few rounds and finally got the go ahead from the sig mc chairs.
The repo: https://github.com/kubernetes-sigs/cluster-inventory-api
@qiujian16's KEP in his personal repo for now: https://github.com/qiujian16/k8s-enhancements/tree/cluster-inventory/keps/sig-multicluster/cluster-inventory
SIG-MC kep draft presentation: https://docs.google.com/document/d/1sUWbe81BTclQ4Uax3flnCoKtEWngH-JA9MyCqljJCBM/

@ArangoGutierrez
Copy link
Contributor Author

right! let me get more specific, within o-c-m project, the placement would intersect quite nicely with NFD to allow workloads to land upon nodes with specific features. in other words, the scaffolding is already there for NFD to become a first class provider into o-c-m's placement. perhaps I need to work this connection from the other side and introduce o-c-m to NFD https://open-cluster-management.io/scenarios/distribute-workload-with-placement/

Hey! we would love to help introduce o-c-m to NFD,
cc @marquiz and I are always looks to help with NFD as much as we can

@qiujian16
Copy link

To my understanding, this is to collect features in a cluster and have a singleton API in this cluster to summarize all the features from nodes? This is a bit different from cluster-inventory-api, since the latter requires a cluster management control plane. However I think there is another project seeming similar from sig-mc (https://github.com/kubernetes-sigs/about-api) to introduce a ClusterProperty API to expose arbitrary properties of the cluster.

@marquiz
Copy link
Contributor

marquiz commented Oct 23, 2023

A lot of action in this space. A few random thoughts from an uneducated person:

  • The O-C-M looks cool. Could be nice in providing a centralized place to store/query cluster info in the hub cluster
  • The about-api looks very sketchy, with basically only one property (string value) per API object. For our purposes we'd need to enhance/extend the API.
  • The cluster-inventory API could be used with O-C-M(?)

@alculquicondor
Copy link

I was tagged in slack about this :)
@mwielgus do you think there could be any implications/simplifications here for Multi cluster Kueue?

@qiujian16
Copy link

qiujian16 commented Oct 24, 2023

A lot of action in this space. A few random thoughts from an uneducated person:

  • The cluster-inventory API could be used with O-C-M(?)

yes, that is the plan.

@vsoch
Copy link

vsoch commented Oct 25, 2023

@ArangoGutierrez I think I understand the change now, and I agree this would be great for Fluence. Do you want any help?

@Sharpz7
Copy link

Sharpz7 commented Oct 26, 2023

Is this KEP actively being worked on?

If not, happy to get the ball rolling by creating a first draft.

@ArangoGutierrez
Copy link
Contributor Author

yes, this week we are just getting attention from the community, before working on it

@yevgeny-shnaidman
Copy link

@ArangoGutierrez is this KEP concern the creation of ClusterFeature CR in the cluster, or it will also try to integrate with cluster management by extending, for example ClusterClaim?

@embik
Copy link
Member

embik commented Oct 29, 2023

Hey! 👋🏻 I'm curious about what cluster features would be exposed exactly. kcp (which is mentioned in the initial description and was notified of this) in specific is not dealing with compute workloads directly, so aggregation of NFD-discovered features would not relate to it. Because of that, I'm mostly interested in this part:

expose cluster wide features currently not reported by NFD in a per node scenario, e.g Network config

Are there any clear ideas what examples there could be beyond network config?

@GeoEducator
Copy link

GeoEducator commented Oct 30, 2023

All, me here to lead, what is this intersection NFD y'all are talking about, I assume it could reduce k8 costs

What do I need to study to contribute

@ArangoGutierrez
Copy link
Contributor Author

Non-Goals

  • change existing behaviour at the node level
  • To be a MultiCluster management tool, this API is to expose NFD discovered features via a single API (CRD)

Hi @yevgeny-shnaidman , it is a Non-goal to walk into cluster management territory

@ArangoGutierrez
Copy link
Contributor Author

Hey! 👋🏻 I'm curious about what cluster features would be exposed exactly. kcp (which is mentioned in the initial description and was notified of this) in specific is not dealing with compute workloads directly, so aggregation of NFD-discovered features would not relate to it. Because of that, I'm mostly interested in this part:

expose cluster wide features currently not reported by NFD in a per node scenario, e.g Network config

Are there any clear ideas what examples there could be beyond network config?

Hey @embik ! we are gathering requests from multiple places.
NFD is a per-node feature discovery solution, to help address the needs of Multi Cluster environments, this ClusterFeature/ClusterInventory API must/should also disclose things that are at the cluster level. So far we have heard a lot to get cluster wide Network config/features/capabilities and in a long term future potentially Topology (for MPI users).
there are other ideas like Storage, cluster health, etc.
If you have an idea, please feel free to share!

@yevgeny-shnaidman
Copy link

@ArangoGutierrez are we going to add something like NodeFeatureRules for the new CRD? i am guessing that it can come useful to determine if cluster supports GPU loads etc'

@ArangoGutierrez
Copy link
Contributor Author

@ArangoGutierrez are we going to add something like NodeFeatureRules for the new CRD? i am guessing that it can come useful to determine if cluster supports GPU loads etc'

ClusterFeatureRules , that could be an addition, initially we aim for a CRD like NodeFeature but at a Cluster level, later on we could add ways of modifying it, like you mention with rules

@marquiz
Copy link
Contributor

marquiz commented Oct 31, 2023

Yes, maybe this could be a further addition/enhancement if some rule-based aggregation of features would be needed

@RainbowMango
Copy link
Member

I'm trying to understand what available features this ClusterFeature provides, an example would be great in addition to the API.

@ArangoGutierrez
Copy link
Contributor Author

I'm trying to understand what available features this ClusterFeature provides, an example would be great in addition to the API.

Hey, sure! You can find all the feature sources NFD discovers and advertise on a per-Node basis here -> https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/features.html#table-of-contents

@ArangoGutierrez
Copy link
Contributor Author

For all those interested I have filed #1487

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
@ArangoGutierrez
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2024
@ArangoGutierrez ArangoGutierrez changed the title KEP: ClusterFeature API (CRD) KEP: NodeFeatureGroup API (CRD) Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster.
Projects
None yet
Development

Successfully merging a pull request may close this issue.