-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable cluster and workload creation on A3+. #145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall! Just some small requests
xpk.py
Outdated
|
||
Args: | ||
args: user provided arguments for running the command. | ||
|
||
Returns: | ||
0 if successful and 1 otherwise. | ||
""" | ||
for i in range(1, 5): | ||
device_type = args.tpu_type if args.tpu_type else args.device_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pass the SystemCharacteristics intoset_up_cluster_network_for_gpu
instead of checking the tpu/device_type args? Hoping to avoid duplication of this arg check in the code and propagate SystemCharacteristics instead
xpk.py
Outdated
# pylint: disable=line-too-long | ||
'https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml' | ||
) | ||
device_type = args.tpu_type if args.tpu_type else args.device_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit to pass in SystemCharacteristics
xpk.py
Outdated
Returns: | ||
str: yaml containing gpu volume | ||
""" | ||
gpu_volumn = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit volume
xpk.py
Outdated
str: yaml containing gpu volume | ||
""" | ||
gpu_volumn = "" | ||
if args.device_type == h100_device_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can pass systemcharacteristics instead and use system.device_type
xpk.py
Outdated
gpu_volume=get_gpu_volume(args), | ||
gpu_rxdm_image=get_gpu_rxdm_image(args), | ||
gpu_rxdm_cmd=get_gpu_rxdm_cmd(args), | ||
gpu_tcp_volume=get_gpu_tcp_volume(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can pass system instead of args here? Mainly to have consistency in how we parse the device usage. This will be useful for future refactors
Thanks for the review! I've updated the code based on your comments in the latest commit (091edf6). PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small comment! LGTM
xpk.py
Outdated
@@ -2764,16 +2755,17 @@ def delete_cluster_subnets(args) -> int: | |||
if return_code > 0: | |||
xpk_print('Listing all subnets failed!') | |||
return return_code | |||
for index in range(1, 5): | |||
|
|||
for index in range(1, 9): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be 1,5 for a3 scenario and 1,9 for a3+?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Based on offline discussion, I updated get_all_subnets_programmatic() to get more accurate results for subnets, so that we don't need to hardcode the number of subnets here. I also cleaned up the code based on pylint checks. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Yuwei!
Fixes / Features
Testing / Documentation
Manual tests passed for cluster creation and workload creation on A3+.