-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pathways in headless mode. #141
Conversation
e9901cb
to
5c80894
Compare
c3fc697
to
b092cef
Compare
5697a3d
to
bfe96d5
Compare
|
||
### "workload create-pathways" Required arguments, specific to Pathways | ||
workload_create_pathways_parser_required_arguments.add_argument( | ||
'--tpu-type', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm a bit confused, why does pathways_parser for workload create need --tpu-type specifically when it is part of workload_create arguments as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
workload create supports provision of one out of tpu-type or device-type in a mutually exclusive group (GPUs and CPUs are supported).
workload create-pathways can only support tpu-type. Pathways users have been using this argument.
Device related args are hence not added in the shared parsers command!
ed2c349
to
4f9de10
Compare
6210901
to
74c1958
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sorry for the delay.
xpk.py
Outdated
and _VERTEX_TENSORBOARD_FEATURE_FLAG | ||
and args.use_vertex_tensorboard | ||
): | ||
# Profiling is handled differently in Pathways workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would that impact viewing pathways workers generated profiles in vertex tensorboard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing this.
I made a design decision to disable this feature for Pathways workload creation using xpk workload create-pathways
, at least for now.
I captured some details around this decision in https://b.corp.google.com/issues/342189448#comment3 .
TL;DR -- While XPK allows creation of the tensorboard, the workload itself also needs to be modified to upload the logs to tensorboard. Tensorboard is available only in certain regions - https://github.com/google/xpk/blob/main/README.md#create-vertex-ai-tensorboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm tensorboard is a very nice for users to inspect their workloads. I am guessing it can be used nicely in pathways so i am curious to see how we integrate pathways and tensorboard in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow the reasoning. I would suspect maxtext already has the functionality to do the upload. What specifically is different in the Pathways setup that you see as a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing specific to Pathways, this feature is available through Maxtext + XPK.
My only concern is that this feature is not extensible to other JAX workloads, that might be used with Pathways, without the specific changes mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the feature back in, with the caveat that users would need to modify their workload (similar to Maxtext) to upload the logs, in order to be able to use them in Pathways + XPK.
74c1958
to
fa00f64
Compare
f2a72f0
to
02c25e3
Compare
xpk.py
Outdated
and _VERTEX_TENSORBOARD_FEATURE_FLAG | ||
and args.use_vertex_tensorboard | ||
): | ||
# Profiling is handled differently in Pathways workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm tensorboard is a very nice for users to inspect their workloads. I am guessing it can be used nicely in pathways so i am curious to see how we integrate pathways and tensorboard in the future
…load create YAMLs.
…ds, accompanying changes.
…for workload create-pathways.
7ed2735
to
57dfeed
Compare
57dfeed
to
342002e
Compare
and args.deploy_stacktrace_sidecar | ||
): | ||
tpu_stacktrace_terminate_command = ( | ||
'touch /shared-volume/stacktrace_signal; ' | ||
) | ||
|
||
xpk_return_user_exit_code = '' | ||
if args.restart_on_user_code_failure: | ||
if not args.use_pathways and args.restart_on_user_code_failure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need many of these if not args.use_pathways
? It the argument doesn't exist in the create_pathways subfunction, then that prevents this code path right?
I feel like having several if not args.use_pathways will be hard to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary to have the condition to not go through this code path. Without the argument in the create_pathways subfunction, the code path is not prevented, but arg not found error is observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds cleaner to use the arguments set in subfunctions to determine which code paths can be run instead of having to do if checks in the code? What do you think, and let me know if there is some error that prevents this from working!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the rationale, I tried to keep it as simple and clean as I could. However, some args are not shared (present in create and not in create_pathways). When a user runs create_pathways and XPK goes through these code paths, the args are "not defined" and hence not found.
Example -
"AttributeError: 'Namespace' object has no attribute 'deploy_stacktrace_sidecar'"
Hence, I meticulously abstracted the code paths for 4 features that are not needed for Pathways. LMK if you see a better way?
@@ -5121,7 +5125,7 @@ def get_pathways_worker_args(args) -> str: | |||
""" | |||
yaml = """- --alsologtostderr | |||
- --pathways_server_port=38677 | |||
- --pathways_resource_manager={args.workload}-rm-0-0.{args.workload}:38677 | |||
- --pathways_resource_manager={args.workload}-rm-0-0.{args.workload}.default.svc.{args.cluster}-domain.:38677 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and all the rest of the changes to the DNS addresses would break existing users who are using clusters that do not use cloud DNS. We should probably share the steps to do that with all users or hide this change behind some check for cloud DNS.
Fixes / Features
Testing / Documentation
Testing details.