Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that pods are scheduled to nodes that meet preferred conditions, while satisfying a series of filter plugins for the scheduler. #124844

Open
fanhaouu opened this issue May 13, 2024 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@fanhaouu
Copy link

fanhaouu commented May 13, 2024

What would you like to be added?

/sig scheduling
/kind feature

Add a new plugin extension to check nodes. Then modify the scheduling filter logic to prioritize nodes that satisfy preferred check conditions, ensuring that these nodes are placed at the beginning of the node array to ensure that the scheduler prioritizes them during each scheduling attempt.

If the community feels this requirement is necessary, I will complete the corresponding KEP and code implementation work.

The current solution within our company is like this, but I believe adding a 'check preferred' extension point would be better:
1、Enable users to assign a specific annotation to pods with the key "xxx.k8s.io/preferred-plugin". The value of this annotation can be either "NodeAffinity" or "TaintToleration".

2、Determine which preferred feature to utilize during scheduling based on the annotation value.

NodeAffinity:

checkPreferred = func(node *v1.Node, pod *v1.Pod) bool {
    affinity := pod.Spec.Affinity
    if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
        terms, err := corev1nodeaffinity.NewPreferredSchedulingTerms(affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution)
        if err != nil {
            klog.ErrorS(err, "failed to parse pod's nodeaffinity", "pod", klog.KObj(pod))
            return false
        }
        if terms != nil && terms.Score(node) > 0 {
            return true
        }
    }
    return false
}

TaintToleration:

checkPreferred = func(node *v1.Node, pod *v1.Pod) bool {
    var filterTolerations []v1.Toleration
    for _, toleration := range pod.Spec.Tolerations {
        if toleration.Effect != v1.TaintEffectPreferNoSchedule {
            continue
        }
        filterTolerations = append(filterTolerations, toleration)
    }
    if len(node.Spec.Taints) != 0 && len(filterTolerations) != 0 {
        for _, taint := range node.Spec.Taints {
            // check only on taints that have effect PreferNoSchedule
            if taint.Effect != v1.TaintEffectPreferNoSchedule {
                continue
            }
 
            if v1helper.TolerationsTolerateTaint(filterTolerations, &taint) {
                return true
            }
        }
    }
    return false
}

3、 Divide nodes into two groups, "passChecked" and "noPassChecked", based on whether they satisfy the preferred check.

4、To ensure equal scheduling probabilities for each node, randomly sort the "passChecked" and "noPassChecked" groups.

5、Reconstruct the nodes array by combining the "passChecked" and "noPassChecked" groups, ensuring that "passChecked" nodes come before "noPassChecked" nodes.

6、Call the "findNodesThatPassFilters" method to search for feasible nodes in the new nodes array.

7、If the length of "passChecked" is 0, adjust the value of "nextStartNodeIndex"; otherwise, leave it unchanged.

Why is this needed?

Currently, for performance reasons, the kube-scheduler follows this scheduling logic:
1、It starts filtering feasible nodes from the nextStartNodeIndex. It stops filtering after a specific number of nodes are filtered out that satisfy the Filter plugin (by default, this number is 100).

2、Then, it applies Score plugins to assign scores to these feasible nodes.

3、Finally, it selects the node with the highest score for scheduling.

However, because each scheduling attempt operates within a partial range and there are multiple Score plugins, this often results in pods not being scheduled onto the nodes users expect.

If we can add an new extension to check nodes, then we can prioritize scheduling pods onto the desired nodes.

@fanhaouu fanhaouu added the kind/feature Categorizes issue or PR as related to a new feature. label May 13, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AxeZhan
Copy link
Member

AxeZhan commented May 13, 2024

I assuming that your goal is to (try to)make sure a pod with preferred affinity and taint toleration to be scheduled to a node which matches node affinity and also has the tolerated taint?
Any specific user case for this behavior?

@fanhaouu
Copy link
Author

I assuming that your goal is to (try to)make sure a pod with preferred affinity and taint toleration to be scheduled to a node which matches node affinity and also has the tolerated taint? Any specific user case for this behavior?

In the case of available pod resources, I want pods to be scheduled onto specific nodes as much as possible. However, the numerous score plugins enabled in the cluster, along with their predefined weights set by SREs, make it challenging for users to dynamically adjust them. Meanwhile, due to performance considerations, the scheduler only traverses and evaluates a subset of nodes. This often leads to suboptimal scheduling results.

@AxeZhan
Copy link
Member

AxeZhan commented May 14, 2024

I get the point that this is trying to get ideal score result. But since the scheduler never guarantees that the pod will be scheduled to the node with the highest score, I'm still confused why this is needed(if you really want to match the node affinity, why not using requiredDuringScheduling).

Anyway, I think you can write a simple doc, and put it on the agenda of sig-scheduling(https://github.com/kubernetes/community/tree/master/sig-scheduling). Folks can have a discussion during the meeting then.

@fanhaouu
Copy link
Author

I get the point that this is trying to get ideal score result. But since the scheduler never guarantees that the pod will be scheduled to the node with the highest score, I'm still confused why this is needed(if you really want to match the node affinity, why not using requiredDuringScheduling).

Anyway, I think you can write a simple doc, and put it on the agenda of sig-scheduling(https://github.com/kubernetes/community/tree/master/sig-scheduling). Folks can have a discussion during the meeting then.

Okay, thank you. I understand your confusion. My main goal is to ensure that pods are always scheduled to preferred nodes first, rather than partial preferred, while meeting resource requirements

@likakuli
Copy link
Contributor

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

4 participants