Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable cluster and workload creation on A3+. #145

Merged
merged 15 commits into from
May 23, 2024
Merged

Conversation

yangyuwei
Copy link
Collaborator

Fixes / Features

  • remove nccl plugin installation in workload creation on A3
  • enable cluster and workload creation on A3+

Testing / Documentation

Manual tests passed for cluster creation and workload creation on A3+.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall! Just some small requests

xpk.py Outdated

Args:
args: user provided arguments for running the command.

Returns:
0 if successful and 1 otherwise.
"""
for i in range(1, 5):
device_type = args.tpu_type if args.tpu_type else args.device_type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass the SystemCharacteristics intoset_up_cluster_network_for_gpu instead of checking the tpu/device_type args? Hoping to avoid duplication of this arg check in the code and propagate SystemCharacteristics instead

xpk.py Outdated
# pylint: disable=line-too-long
'https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml'
)
device_type = args.tpu_type if args.tpu_type else args.device_type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit to pass in SystemCharacteristics

xpk.py Outdated
Returns:
str: yaml containing gpu volume
"""
gpu_volumn = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit volume

xpk.py Outdated
str: yaml containing gpu volume
"""
gpu_volumn = ""
if args.device_type == h100_device_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can pass systemcharacteristics instead and use system.device_type

xpk.py Outdated
Comment on lines 5786 to 5789
gpu_volume=get_gpu_volume(args),
gpu_rxdm_image=get_gpu_rxdm_image(args),
gpu_rxdm_cmd=get_gpu_rxdm_cmd(args),
gpu_tcp_volume=get_gpu_tcp_volume(args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can pass system instead of args here? Mainly to have consistency in how we parse the device usage. This will be useful for future refactors

@yangyuwei
Copy link
Collaborator Author

LGTM overall! Just some small requests

Thanks for the review! I've updated the code based on your comments in the latest commit (091edf6). PTAL.

Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment! LGTM

xpk.py Outdated
@@ -2764,16 +2755,17 @@ def delete_cluster_subnets(args) -> int:
if return_code > 0:
xpk_print('Listing all subnets failed!')
return return_code
for index in range(1, 5):

for index in range(1, 9):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be 1,5 for a3 scenario and 1,9 for a3+?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Based on offline discussion, I updated get_all_subnets_programmatic() to get more accurate results for subnets, so that we don't need to hardcode the number of subnets here. I also cleaned up the code based on pylint checks. PTAL.

Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Yuwei!

@yangyuwei yangyuwei merged commit 9224dda into main May 23, 2024
6 checks passed
@yangyuwei yangyuwei deleted the yangyuwei-xpk-a3plus branch May 23, 2024 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants