Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix volcano podgroup update issue #2079

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ckyuto
Copy link

@ckyuto ckyuto commented Apr 22, 2024

What this PR does / why we need it:
This is the fix cause by this PR, the minMember may be updated when the number of replica is changed. However, this also accidentally change the queue value. It also sync up the queue value in the podGroup with the value in runPolicy.SchedulingPolicy.Queue, which is not always applicable to all use cases.

In our use cases we'll inject the queue value according to which org this user belongs to. This change will override the value we set in the queue. The queue value should not be updated once the it is set.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this fix @ckyuto!
Please can you rebase it ?

@coveralls
Copy link

coveralls commented Apr 26, 2024

Pull Request Test Coverage Report for Build 8936440813

Details

  • 0 of 3 (0.0%) changed or added relevant lines in 1 file are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.02%) to 35.398%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/job.go 0 3 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
pkg/controller.v1/mpi/mpijob_controller.go 2 81.05%
Totals Coverage Status
Change from base Build 8932297222: 0.02%
Covered Lines: 4377
Relevant Lines: 12365

💛 - Coveralls

@ckyuto
Copy link
Author

ckyuto commented May 1, 2024

@andreyvelich Can you help review?

@tenzen-y
Copy link
Member

tenzen-y commented May 2, 2024

@ckyuto Could you eliminate irrelevant commits?

@ckyuto
Copy link
Author

ckyuto commented May 2, 2024

tenzen-y
Thanks for the comment. Removed.

@ckyuto ckyuto force-pushed the wyen/fix_pg_update branch 2 times, most recently from f3c56ef to 88347fa Compare May 2, 2024 09:40
@google-oss-prow google-oss-prow bot added size/XS and removed size/M labels May 3, 2024
@ckyuto
Copy link
Author

ckyuto commented May 6, 2024

@andreyvelich @tenzen-y I think there's a simple way to fix this. Can I get a review again?

Copy link
Member

@Tomcli Tomcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Tomcli
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Tomcli
Copy link
Member

Tomcli commented May 7, 2024

Hi @tenzen-y @andreyvelich Could you help trigger the CI test for this PR? We can add a simple unit test if needed. Thanks

@tenzen-y
Copy link
Member

tenzen-y commented May 8, 2024

Hi @tenzen-y @andreyvelich Could you help trigger the CI test for this PR? We can add a simple unit test if needed. Thanks

Sure, we should review and evaluate this PR during the code freeze.
/assign

Comment on lines +288 to +290
if q := volcanoPodGroup.Spec.Queue; len(q) > 0 {
queue = q
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckyuto This overriding looks fine, but this might be surprising for users since users can not observe this overriding anywhere.
So, could you implement validation webhooks here as well?

https://github.com/kubeflow/training-operator/tree/73398806a64899a36263860c1d34bdc2dd1ec291/pkg/webhooks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tenzen-y . I'm not sure what should I validate here. Can you elaborate?
I think this should be the expected behavior for most use cases. This change prevent the queue being overridden by the change of this PR, which originally just intended to sync minMember of queue name to replica but accidentally sync the queue name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckyuto You can prevent updating on queue name here:

func (w Webhook) ValidateUpdate(ctx context.Context, _, newObj runtime.Object) (admission.Warnings, error) {

Also, could you implement the same validation in the PyTorchJob, TFJob, and XGBoost as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckyuto Ah, I guess that we can use CEL validation here instead of implementing validations like this.
Could you try it?

type SchedulingPolicy struct { 
 	MinAvailable           *int32                                 `json:"minAvailable,omitempty"` 
 	// +kubebuilder:validation:XValidation:rule="self == oldSelf", message="field is immutable"
 	Queue                  string                                 `json:"queue,omitempty"` 
[...]

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
MinAvailable *int32 `json:"minAvailable,omitempty"`
Queue string `json:"queue,omitempty"`
MinResources *map[v1.ResourceName]resource.Quantity `json:"minResources,omitempty"`
PriorityClass string `json:"priorityClass,omitempty"`
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
}

Copy link
Author

@ckyuto ckyuto May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y No, none of above will work. These objects are validation for the value of kubeflow jobs, and these value won't be changed after the job is created. The queue value update happens on the queue in volcano podGroup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should block updating the .runPolicy.schedulingPolicy.queue since the field is propagated to the PodGroup resource. We should not override or ignore the defined field without notification for the users.

I didn't want to say that we should introduce webhook validation instead of controller logic.
I wanted to say that we should implement the validation as well.

If we don't implement the webhook validations, users can not find the reason why updated .runPolicy.schedulingPolicy.queue was not propagated to the PodGroup resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckyuto @tenzen-y Do we have any updates here ?
Do we need to include this fix as part of 1.8 release ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still waiting for response from @ckyuto.

@ckyuto
Copy link
Author

ckyuto commented May 25, 2024 via email

@tenzen-y
Copy link
Member

Hi, I am in a travel. I’ll have a commit next Mon to address the comments. Can you wait?Sent from my iPhoneOn May 24, 2024, at 3:32 PM, Yuki Iwai @.***> wrote: @tenzen-y commented on this pull request. In pkg/controller.v1/common/job.go:

  • if q := volcanoPodGroup.Spec.Queue; len(q) > 0 {
  • queue = q + } I'm still waiting for response from @ckyuto. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Sure, safe travel 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants