Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure batch scheduler implementation #688

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

kurman
Copy link
Contributor

@kurman kurman commented Feb 9, 2023

  1. Modify commands to initialize scheduler with options that can be defined in config. Generally most of the schedulers can operate using scheduler options, however in some cases for multi-tenant setup operations needs to be pre-configured. For example, Azure Batch allows multiple Batch accounts
  2. Initial implementation of the Azure batch scheduler. Provides support for scheduling, stopping, listing and describing jobs. Azure Batch has a good support HPC support and training jobs should work out of the box.

Major things that needs to be addressed:

  • Support for patching using Docker (currently it can run existing image)
  • Creating autoscaling pools that maps to resource requirements

Testing

  • Unit test
  • python -m torchx.cli.main run --scheduler azure_batch --scheduler_args image_repo=ghcr.io/pytorch/torchx utils.echo --image alpine:latest --msg hello

Screenshot 2023-02-08 at 7 15 27 PM

Screenshot 2023-02-08 at 7 15 43 PM

Screenshot 2023-02-08 at 7 15 51 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2023
@codecov
Copy link

codecov bot commented Feb 9, 2023

Codecov Report

Merging #688 (1227cf6) into main (ef01789) will decrease coverage by 0.50%.
The diff coverage is 73.10%.

@@            Coverage Diff             @@
##             main     #688      +/-   ##
==========================================
- Coverage   92.47%   91.98%   -0.50%     
==========================================
  Files          82       83       +1     
  Lines        5664     5801     +137     
==========================================
+ Hits         5238     5336      +98     
- Misses        426      465      +39     
Impacted Files Coverage Δ
torchx/schedulers/__init__.py 95.23% <ø> (ø)
torchx/schedulers/azure_batch_scheduler.py 67.50% <67.50%> (ø)
torchx/cli/argparse_util.py 100.00% <100.00%> (ø)
torchx/cli/cmd_cancel.py 100.00% <100.00%> (ø)
torchx/cli/cmd_describe.py 100.00% <100.00%> (ø)
torchx/cli/cmd_list.py 100.00% <100.00%> (ø)
torchx/cli/cmd_log.py 95.83% <100.00%> (+0.08%) ⬆️
torchx/cli/cmd_run.py 88.88% <100.00%> (+0.09%) ⬆️
torchx/cli/cmd_runopts.py 90.90% <100.00%> (+0.90%) ⬆️
torchx/cli/cmd_status.py 96.77% <100.00%> (+0.22%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +143 to +146
session_name: str,
batch_account_name: Optional[str],
batch_account_key: Optional[str],
batch_account_url: Optional[str],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do these have to be taken in the constructor arguments (versus via the runoptions)? Looks like they are used when building the client in _build_client() and the client is create on-demand in the schedule API (rather than it being created once and being cached somewhere).

It looks like the real underlying reason why you added these to the constructor arguments is b/c the runcfg is not passed to the other APIs (e.g. list, describe, etc). Why not make the runcfg available to the other APIs instead of splitting scheduler configs between constructor args and runoptions?

self.batch_account_url = batch_account_url

def _build_client(self) -> BatchServiceClient:
credentials = SharedKeyCredentials(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, for AWS, the boto3 client grabs the default credentials (as configured in ~/.aws/credentials or from instance profile (if on an EC2 instance). Does a similar concept not exist in Azure? If it does we should follow the same strategy to deal with credentials.

For example, if a user using aws_batch scheduler wanted to change to a different credential (either to access a different account or to switch to a different IAM role), they would change the creds provider in the ~/.aws configuration rather than having to pass this information to torchx. This is much clearer to the users since they can use the vanilla AWS credentials chain rather than having to pass parameters to torchx.

@kiukchung
Copy link
Collaborator

@kurman any plans to actually commit this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants