New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix volcano podgroup update issue #2079
base: master
Are you sure you want to change the base?
Conversation
b4457b7
to
c4b4547
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this fix @ckyuto!
Please can you rebase it ?
Pull Request Test Coverage Report for Build 8936440813Details
💛 - Coveralls |
9397c3d
to
ada7d06
Compare
c8d7650
to
25715d2
Compare
@andreyvelich Can you help review? |
@ckyuto Could you eliminate irrelevant commits? |
|
f3c56ef
to
88347fa
Compare
Signed-off-by: Weiyu Yen <[email protected]>
Signed-off-by: Weiyu Yen <[email protected]>
@andreyvelich @tenzen-y I think there's a simple way to fix this. Can I get a review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Tomcli The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @tenzen-y @andreyvelich Could you help trigger the CI test for this PR? We can add a simple unit test if needed. Thanks |
Sure, we should review and evaluate this PR during the code freeze. |
if q := volcanoPodGroup.Spec.Queue; len(q) > 0 { | ||
queue = q | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckyuto This overriding looks fine, but this might be surprising for users since users can not observe this overriding anywhere.
So, could you implement validation webhooks here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tenzen-y . I'm not sure what should I validate here. Can you elaborate?
I think this should be the expected behavior for most use cases. This change prevent the queue being overridden by the change of this PR, which originally just intended to sync minMember of queue name to replica but accidentally sync the queue name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckyuto You can prevent updating on queue name here:
func (w Webhook) ValidateUpdate(ctx context.Context, _, newObj runtime.Object) (admission.Warnings, error) { |
Also, could you implement the same validation in the PyTorchJob, TFJob, and XGBoost as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckyuto Ah, I guess that we can use CEL validation here instead of implementing validations like this.
Could you try it?
type SchedulingPolicy struct {
MinAvailable *int32 `json:"minAvailable,omitempty"`
// +kubebuilder:validation:XValidation:rule="self == oldSelf", message="field is immutable"
Queue string `json:"queue,omitempty"`
[...]
training-operator/pkg/apis/kubeflow.org/v1/common_types.go
Lines 226 to 234 in 7339880
// SchedulingPolicy encapsulates various scheduling policies of the distributed training | |
// job, for example `minAvailable` for gang-scheduling. | |
type SchedulingPolicy struct { | |
MinAvailable *int32 `json:"minAvailable,omitempty"` | |
Queue string `json:"queue,omitempty"` | |
MinResources *map[v1.ResourceName]resource.Quantity `json:"minResources,omitempty"` | |
PriorityClass string `json:"priorityClass,omitempty"` | |
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y No, none of above will work. These objects are validation for the value of kubeflow jobs, and these value won't be changed after the job is created. The queue value update happens on the queue in volcano podGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should block updating the .runPolicy.schedulingPolicy.queue
since the field is propagated to the PodGroup resource. We should not override or ignore the defined field without notification for the users.
I didn't want to say that we should introduce webhook validation instead of controller logic.
I wanted to say that we should implement the validation as well.
If we don't implement the webhook validations, users can not find the reason why updated .runPolicy.schedulingPolicy.queue
was not propagated to the PodGroup resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still waiting for response from @ckyuto.
Hi, I am in a travel. I’ll have a commit next Mon to address the comments. Can you wait?Sent from my iPhoneOn May 24, 2024, at 3:32 PM, Yuki Iwai ***@***.***> wrote:
@tenzen-y commented on this pull request.
In pkg/controller.v1/common/job.go:
+ if q := volcanoPodGroup.Spec.Queue; len(q) > 0 {
+ queue = q
+ }
I'm still waiting for response from @ckyuto.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sure, safe travel 👍 |
What this PR does / why we need it:
This is the fix cause by this PR, the minMember may be updated when the number of replica is changed. However, this also accidentally change the queue value. It also sync up the queue value in the podGroup with the value in runPolicy.SchedulingPolicy.Queue, which is not always applicable to all use cases.
In our use cases we'll inject the queue value according to which org this user belongs to. This change will override the value we set in the queue. The queue value should not be updated once the it is set.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: