Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathways in headless mode. #141

Merged
merged 12 commits into from
May 23, 2024
Merged

Pathways in headless mode. #141

merged 12 commits into from
May 23, 2024

Conversation

RoshaniN
Copy link
Collaborator

@RoshaniN RoshaniN commented May 2, 2024

Fixes / Features

  • Enable Cloud DNS on Pathways clusters.
  • Add user job conditionally based on whether headless mode is defined or not.
  • Print out proxy address for Pathways in headless mode. Users would connect to this address from Vertex AI notebook / GCE VMs, to execute their JAX workloads.
  • Organized XPK subcommands for Pathways.
    • Add cluster create-pathways subcommand. Deprecate --enable-pathways.
    • Add workload create-pathways subcommand. Deprecate --use-pathways.
  • Update build tests and nightly tests, reordered a few tests.
  • Restricting some features to create as they are not suitable for create-pathways, at the moment. (eg: autoprovisioning, debug dump to gcs, stack trace sidecar and restart on user failures.)

Testing / Documentation

Testing details.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@RoshaniN RoshaniN requested a review from Obliviour as a code owner May 2, 2024 20:49
@RoshaniN RoshaniN marked this pull request as draft May 2, 2024 20:49
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 4 times, most recently from e9901cb to 5c80894 Compare May 7, 2024 17:56
@RoshaniN RoshaniN requested a review from shaurya89 May 7, 2024 17:57
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 11 times, most recently from c3fc697 to b092cef Compare May 14, 2024 20:26
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 4 times, most recently from 5697a3d to bfe96d5 Compare May 15, 2024 18:56
xpk.py Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved

### "workload create-pathways" Required arguments, specific to Pathways
workload_create_pathways_parser_required_arguments.add_argument(
'--tpu-type',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm a bit confused, why does pathways_parser for workload create need --tpu-type specifically when it is part of workload_create arguments as well

Copy link
Collaborator Author

@RoshaniN RoshaniN May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workload create supports provision of one out of tpu-type or device-type in a mutually exclusive group (GPUs and CPUs are supported).
workload create-pathways can only support tpu-type. Pathways users have been using this argument.

Device related args are hence not added in the shared parsers command!

https://github.com/google/xpk/blob/main/xpk.py#L7021-L7041

@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 4 times, most recently from ed2c349 to 4f9de10 Compare May 16, 2024 21:07
@RoshaniN RoshaniN requested a review from Obliviour May 16, 2024 21:08
@RoshaniN RoshaniN changed the title [DRAFT] Pathways in headless mode. Pathways in headless mode. May 16, 2024
@RoshaniN RoshaniN marked this pull request as ready for review May 16, 2024 22:26
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 2 times, most recently from 6210901 to 74c1958 Compare May 17, 2024 20:37
Copy link
Collaborator

@shaurya89 shaurya89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sorry for the delay.

xpk.py Outdated
and _VERTEX_TENSORBOARD_FEATURE_FLAG
and args.use_vertex_tensorboard
):
# Profiling is handled differently in Pathways workloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that impact viewing pathways workers generated profiles in vertex tensorboard?

Copy link
Collaborator Author

@RoshaniN RoshaniN May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this.
I made a design decision to disable this feature for Pathways workload creation using xpk workload create-pathways , at least for now.
I captured some details around this decision in https://b.corp.google.com/issues/342189448#comment3 .

TL;DR -- While XPK allows creation of the tensorboard, the workload itself also needs to be modified to upload the logs to tensorboard. Tensorboard is available only in certain regions - https://github.com/google/xpk/blob/main/README.md#create-vertex-ai-tensorboard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm tensorboard is a very nice for users to inspect their workloads. I am guessing it can be used nicely in pathways so i am curious to see how we integrate pathways and tensorboard in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow the reasoning. I would suspect maxtext already has the functionality to do the upload. What specifically is different in the Pathways setup that you see as a concern?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing specific to Pathways, this feature is available through Maxtext + XPK.

My only concern is that this feature is not extensible to other JAX workloads, that might be used with Pathways, without the specific changes mentioned above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the feature back in, with the caveat that users would need to modify their workload (similar to Maxtext) to upload the logs, in order to be able to use them in Pathways + XPK.

xpk.py Show resolved Hide resolved
xpk.py Show resolved Hide resolved
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch from 74c1958 to fa00f64 Compare May 22, 2024 18:58
@RoshaniN RoshaniN requested a review from shaurya89 May 22, 2024 18:59
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 2 times, most recently from f2a72f0 to 02c25e3 Compare May 22, 2024 20:16
xpk.py Show resolved Hide resolved
xpk.py Show resolved Hide resolved
xpk.py Outdated
and _VERTEX_TENSORBOARD_FEATURE_FLAG
and args.use_vertex_tensorboard
):
# Profiling is handled differently in Pathways workloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm tensorboard is a very nice for users to inspect their workloads. I am guessing it can be used nicely in pathways so i am curious to see how we integrate pathways and tensorboard in the future

@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch 2 times, most recently from 7ed2735 to 57dfeed Compare May 22, 2024 21:43
@RoshaniN RoshaniN force-pushed the roshanin_pathways_headless branch from 57dfeed to 342002e Compare May 22, 2024 21:55
and args.deploy_stacktrace_sidecar
):
tpu_stacktrace_terminate_command = (
'touch /shared-volume/stacktrace_signal; '
)

xpk_return_user_exit_code = ''
if args.restart_on_user_code_failure:
if not args.use_pathways and args.restart_on_user_code_failure:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need many of these if not args.use_pathways? It the argument doesn't exist in the create_pathways subfunction, then that prevents this code path right?

I feel like having several if not args.use_pathways will be hard to maintain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to have the condition to not go through this code path. Without the argument in the create_pathways subfunction, the code path is not prevented, but arg not found error is observed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds cleaner to use the arguments set in subfunctions to determine which code paths can be run instead of having to do if checks in the code? What do you think, and let me know if there is some error that prevents this from working!

Copy link
Collaborator Author

@RoshaniN RoshaniN May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the rationale, I tried to keep it as simple and clean as I could. However, some args are not shared (present in create and not in create_pathways). When a user runs create_pathways and XPK goes through these code paths, the args are "not defined" and hence not found.
Example -
"AttributeError: 'Namespace' object has no attribute 'deploy_stacktrace_sidecar'"

Hence, I meticulously abstracted the code paths for 4 features that are not needed for Pathways. LMK if you see a better way?

http://b/342189448

@RoshaniN RoshaniN merged commit 01ac4d4 into main May 23, 2024
6 checks passed
@RoshaniN RoshaniN deleted the roshanin_pathways_headless branch May 23, 2024 19:09
@@ -5121,7 +5125,7 @@ def get_pathways_worker_args(args) -> str:
"""
yaml = """- --alsologtostderr
- --pathways_server_port=38677
- --pathways_resource_manager={args.workload}-rm-0-0.{args.workload}:38677
- --pathways_resource_manager={args.workload}-rm-0-0.{args.workload}.default.svc.{args.cluster}-domain.:38677
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and all the rest of the changes to the DNS addresses would break existing users who are using clusters that do not use cloud DNS. We should probably share the steps to do that with all users or hide this change behind some check for cloud DNS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants