Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Kueue Job Scheduler #822

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Bobbins228
Copy link

@Bobbins228 Bobbins228 commented Feb 14, 2024

Added a Kubernetes Kueue batch job scheduler based on the Kubernetes Scheduler
Note: variable local_kueue="local-kueue-name" is required in the scheduler args for the queue-name label and for priority add kueue_priority_class="kueue-priority-class-name" to the sheduler args

Manual Testing

Set up a Kubernetes/Openshift cluster

  • Install Kueue
  • Clone this branch locally and create a torchx whl file with python3 setup.py bdist_wheel
  • Install the whl file with pip install dist/torchx-0.7.0.dev0-py3-none-any.whl
  • Authenticate to your K8s cluster
  • Run a sample job with torchx run --scheduler kueue_job --scheduler_args namespace=default,local_kueue="default-kueue",image_repo="user/alpine" utils.echo --image alpine:latest --msg hello - should return something like kueue_job://torchx_user/1234
  • Get the jobs status with torchx status kueue_job://torchx_user/1234
  • All jobs created with this scheduler should gain the Suspended/JobResumed status
  • Users can add custom annotations in the scheduler args following this scheme "annotations": {"key":"value"}

Integration test

  • Start the minikube setup script with bash setup_minikube_kueue.sh
  • Run the test file with python scripts/kueue_test.py --container_repo localhost:5000/torchx
  • For dry_run run python scripts/kueue_test.py --container_repo localhost:5000/torchx --dryrun

Test plan:

Created Unit tests based on Kubernetes Unit tests.

@facebook-github-bot
Copy link
Contributor

Hi @Bobbins228!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
@ccharest93
Copy link
Contributor

What does kueue do that volcano doesn't? Is there a reason to support more than one batch scheduler on kubernetes?

@Bobbins228
Copy link
Author

Hi @ccharest93,
We have decided to add a Kueue Batch Job scheduler as Kueue is backed by Kubernetes Batch Sig and this gives users more options in terms of choosing a Kubernetes Scheduler.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 1, 2024
Copy link
Contributor

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick comment

scripts/component_integration_tests.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
Copy link
Contributor

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see this -- does this also work with the native K8s batch API?

torchx/schedulers/__init__.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
torchx/specs/api.py Outdated Show resolved Hide resolved
scripts/component_integration_tests.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_job_scheduler.py Outdated Show resolved Hide resolved
scripts/setup_minikube_kueue.sh Show resolved Hide resolved
torchx/util/role_to_pod.py Outdated Show resolved Hide resolved
torchx/schedulers/kueue_scheduler.py Outdated Show resolved Hide resolved
@d4l3k
Copy link
Contributor

d4l3k commented Mar 26, 2024

Looks like there's also valid pyre/lint issues

@d4l3k
Copy link
Contributor

d4l3k commented Mar 27, 2024

@Bobbins228 there's been a bunch of cleanup on tests in on the main branch -- if you rebase/merge all tests should be passing (assuming no new breaking changes)

@xujyan
Copy link

xujyan commented Mar 28, 2024

@Bobbins228 did you try multi-node ddp with it? I think you'd need a JobSet for that which sets up a headless k8s service for workers to reach the master. See example here https://github.com/kubernetes-sigs/jobset/blob/main/examples/pytorch/resnet-cifar10/resnet.yaml

Another way to approach this is perhaps to not be tightly coupled with a specific operator but rather just output a k8s standard Job (or even Pod) spec with the container image and allow it to be piped into other tools that can transform it into whatever that's suitable. But AFAICT something like a --output-format yaml isn't supported in torchx and --dryrun aims to provide human readable instead of tool parsable output, so I don't know if this is realistic?

@d4l3k
Copy link
Contributor

d4l3k commented Mar 28, 2024

@xujyan @Bobbins228 looks like the ddp job on Kueue is having issues

@xujyan
Copy link

xujyan commented Mar 29, 2024

I would expect that. Like I suggested above you'd need to produce a JobSet instead of a Job, and as the example here shows, point to the master at

              - name: MASTER_ADDR
                value: "pytorch-workers-0-0.pytorch"

where the value is of the format <jobset_name>-<job_name>-0-0.<jobset_name>

@xujyan
Copy link

xujyan commented Mar 29, 2024

And you'd need to install jobset controller in the test env too, see README

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants