Support parallel for operations, like data parallel training, model parallel training etc #3102

typhoonzero · 2023-02-07T02:34:36Z

What changes were proposed in this pull request?

Support ParallelFor pipeline features for each operation. Set parallel_count > 2 to start parallel operations like distributed training, distributed data processing etc. Below are features/limitations:

Suport kfp + Argo currently.
Automatically set TF_CONFIG for Tensorflow and MASTER_ADDR, MASTER_PORT for Pytorch. Yet in some cases, workers rank >=1 should wait for rank0 to start. This can be achieved by waiting rank0's TCP server port by user.
No Parameter Server style distritubed training support, since it's less popular now.
Components/Operations before and after parallelfor operation is supported

How was this pull request tested?

Unit tests are included in test_bootstrapper.py to ensure argument parallel_count is working.

TODO:

support airflow, kfp tekton
support parallel operation output file (fetch output only from rank0)
add examples to do tensorflow/pytorch/accelerate distributed training

elyra/kfp/bootstrapper.py

elyra/pipeline/kfp/processor_kfp.py

Signed-off-by: typhoonzero <[email protected]>

…_parallel_for_operations

Signed-off-by: typhoonzero <[email protected]>

typhoonzero · 2023-05-11T03:13:57Z

@akchinSTC Can you please checkout this feature

lresende · 2023-05-11T17:59:31Z

@akchinSTC Can you please checkout this feature

What is the current status of this PR? I still see the work in progress tag..

typhoonzero · 2023-05-12T05:51:44Z

@akchinSTC Can you please checkout this feature

What is the current status of this PR? I still see the work in progress tag..

This PR is ready for review now.
I removed "WIP" from the title, yet the tag seems still there.

Work under the TODO list will move on after this feature is merged.

github-advanced-security bot found potential problems Feb 7, 2023

View reviewed changes

elyra/kfp/bootstrapper.py Fixed Show fixed Hide fixed

akchinSTC added component:pipeline-editor pipeline editor component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. labels Feb 7, 2023

github-advanced-security bot found potential problems May 4, 2023

View reviewed changes

elyra/pipeline/kfp/processor_kfp.py Fixed Show fixed Hide fixed

Support parallelfor operations(re-push)

ef1b41f

Signed-off-by: typhoonzero <[email protected]>

typhoonzero force-pushed the support_parallel_for_operations branch from b14c41c to ef1b41f Compare May 5, 2023 10:49

typhoonzero added 2 commits May 8, 2023 16:38

Merge branch 'main' of https://github.com/elyra-ai/elyra into support…

c32a1be

…_parallel_for_operations

fix tests

5cb02ab

Signed-off-by: typhoonzero <[email protected]>

typhoonzero force-pushed the support_parallel_for_operations branch from 7d371cd to 5cb02ab Compare May 9, 2023 11:01

fix format

8fe6526

Signed-off-by: typhoonzero <[email protected]>

typhoonzero force-pushed the support_parallel_for_operations branch from c6653bb to cb47ff4 Compare May 10, 2023 02:17

fix integration tests

ee6eb87

Signed-off-by: typhoonzero <[email protected]>

typhoonzero force-pushed the support_parallel_for_operations branch from cb47ff4 to ee6eb87 Compare May 10, 2023 09:22

typhoonzero changed the title ~~[WIP]: Support parallel for operations~~ Support parallel for operations, like data parallel training, model parallel training etc May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel for operations, like data parallel training, model parallel training etc #3102

Support parallel for operations, like data parallel training, model parallel training etc #3102

typhoonzero commented Feb 7, 2023 •

edited

typhoonzero commented May 11, 2023

lresende commented May 11, 2023

typhoonzero commented May 12, 2023

Support parallel for operations, like data parallel training, model parallel training etc #3102

Are you sure you want to change the base?

Support parallel for operations, like data parallel training, model parallel training etc #3102

Conversation

typhoonzero commented Feb 7, 2023 • edited

What changes were proposed in this pull request?

How was this pull request tested?

TODO:

typhoonzero commented May 11, 2023

lresende commented May 11, 2023

typhoonzero commented May 12, 2023

typhoonzero commented Feb 7, 2023 •

edited