Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to process predictor error response ? #684

Open
yinsenyan opened this issue May 11, 2023 · 3 comments
Open

How to process predictor error response ? #684

yinsenyan opened this issue May 11, 2023 · 3 comments
Labels
kind/feature New feature or request

Comments

@yinsenyan
Copy link
Contributor

yinsenyan commented May 11, 2023

What would you like to be added:

If predict http request failed , return an error and cancel scheduling , like this:
https://github.com/clusternet/clusternet/blob/main/pkg/scheduler/framework/plugins/predictor/predictor.go#L128
One cluster predictor failure resulted in a subscription scheduling failure, which is inappropriate.

Why is this needed:

The task is to find a better way to solve this problem.

  1. If predict request failed, return 0
  2. drop cluster from available list when predict is not health

If method 1 is used, cluster which replicas is 0 will still in binding cluster, and cannot be removed, either it needs to be removed during the merge process, or there might be other ways to address this.

And if method 2, drop the cluster from available cluster list when one feed predict failed even this subs have many feeds, It is a radical approach when there are only a few child clusters.

@yinsenyan yinsenyan added the kind/feature New feature or request label May 11, 2023
@yinsenyan
Copy link
Contributor Author

  1. add post-predict extension point to process predictor unhealthy cluster

@yinsenyan
Copy link
Contributor Author

@dixudx @Garrybest

@dixudx
Copy link
Member

dixudx commented May 15, 2023

  1. If predict request failed, return 0
    If method 1 is used, cluster which replicas is 0 will still in binding cluster, and cannot be removed, either it needs to be removed during the merge process, or there might be other ways to address this.

I'd prefer using method 1 to return 0 replica, which is friendly to current scheduling framework and implementations.

By adding a new flag in struct ClusterScore to indicate such unhealthy predictor cases, all clusters with replicas 0 could be easily pruned in the function RunPredictPlugins.

  1. add post-predict extension point to process predictor unhealthy cluster

If so, it will be better to taint the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants