Eager fixes for distributed execution #558

tammam1998 · 2022-05-20T14:44:47Z

Add eager after remote_read commands
Fix an issue causing conflicting node ids on the worker. DFGNode.next_id starts from 0 in the worker which causes us to override some nodes when adding eagers. A better long-term solution is to make node id generation local to the graph.

Signed-off-by: Tammam Mustafa <[email protected]>

github-actions · 2022-05-20T14:53:19Z

OS:ubuntu-20.04
Fri May 20 14:53:18 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

github-actions · 2022-05-20T14:54:26Z

OS:ubuntu-20.04
Fri May 20 14:54:26 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

github-actions · 2022-05-20T14:54:27Z

OS:ubuntu-18.04
Fri May 20 14:54:27 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

github-actions · 2022-05-20T14:54:38Z

OS:ubuntu-18.04
Fri May 20 14:54:38 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

github-actions · 2022-05-20T14:57:13Z

OS = Debian 10
CPU = Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Ram = 15752
Hash = fd53591
Kernel= Linux 4.15.0-167-generic x86_64

benchmark	tests	passed	failed	untested	unresolved	unsupported	not_in_use
posix	494	375	41	31	6	40	1
intro	2	2	0	0	0	0	0
interface	35	35	0	0	0	0	0
compiler	54	54	0	0	0	0	0
aggregator	109	109	0	0	0	0	0

Signed-off-by: Tammam Mustafa <[email protected]>

github-actions · 2022-05-20T15:10:43Z

OS = Debian 10
CPU = Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Ram = 15752
Hash = 494a436
Kernel= Linux 4.15.0-167-generic x86_64

benchmark	tests	passed	failed	untested	unresolved	unsupported	not_in_use
posix	494	375	41	31	6	40	1
intro	2	2	0	0	0	0	0
interface	35	35	0	0	0	0	0
compiler	54	54	0	0	0	0	0
aggregator	109	109	0	0	0	0	0

github-actions · 2022-05-20T15:14:02Z

OS:ubuntu-18.04
Fri May 20 15:14:01 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

github-actions · 2022-05-20T15:14:17Z

OS:ubuntu-20.04
Fri May 20 15:14:17 UTC 2022
intro: 2/2 tests passed.
interface: 35/35 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

github-actions · 2022-05-20T15:22:32Z

OS = Debian 10
CPU = Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Ram = 15752
Hash = 3334b76
Kernel= Linux 4.15.0-167-generic x86_64

benchmark	tests	passed	failed	untested	unresolved	unsupported	not_in_use
posix	494	375	41	31	6	40	1
intro	2	2	0	0	0	0	0
interface	35	35	0	0	0	0	0
compiler	54	54	0	0	0	0	0
aggregator	109	109	0	0	0	0	0

angelhof

I am a little concerned about the changes in the graph traversal. Where they necessary or simple improvements?

angelhof · 2022-05-24T14:45:56Z

compiler/dspash/ir_helper.py

+ # Set DFGNode next id to not clash with already existing ids
+ # TODO: ideally we should get the next_id from the graph object
+ # to avoid conflicts across parallel processes
+ DFGNode.next_id = max(DFGNode.next_id , max(graph.nodes.keys()) + 1)


Shouldn't we do that in general? Outside of this if statement?

angelhof · 2022-05-24T14:46:23Z

compiler/dspash/ir_helper.py

@@ -261,6 +265,9 @@ def assign_workers_to_subgraphs(subgraphs:List[IR], file_id_gen: FileIdGen, inpu
 # sometimes a command can have both a file resource and an ephemeral resources (example: spell oneliner)
 continue

+ # for worker, graph in worker_subgraph_pairs:


I would delete this

angelhof · 2022-05-24T14:48:35Z

compiler/pash_runtime.py

@@ -721,14 +722,20 @@ def add_eager_nodes(graph, use_dgsh_tee):
 intermediateFileIdGen = FileIdGen(0, runtime_config['eager_intermediate_prefix'])

 ## Get the next nodes
- workset = [node for source_node_id in source_node_ids for node in graph.get_next_nodes(source_node_id)]


I would be tempted to think that there was a reason why this was here. Maybe we don't want to start from all source_node_ids? Is there a particular reason you deleted this? The reason why I am asking is because regressions here cannot be caught by any test (the compiler won't fail most likely but maybe it won't optimize that much).

angelhof · 2022-05-24T14:49:02Z

compiler/pash_runtime.py

 if (not curr_id in visited):
 visited.add(curr_id)
 next_node_ids = graph.get_next_nodes(curr_id)
+
+ # Skip if this is the last node
+ if not next_node_ids:


Similarly, is that necessary or an optimization?

add eager after remote_read and fix conflicting node ids issue

494a436

Signed-off-by: Tammam Mustafa <[email protected]>

tammam1998 force-pushed the dspash-eagers-fix branch from fd53591 to 494a436 Compare May 20, 2022 14:46

tammam1998 requested a review from angelhof May 20, 2022 14:47

fix small mistake

3334b76

Signed-off-by: Tammam Mustafa <[email protected]>

angelhof requested changes May 24, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eager fixes for distributed execution #558

Eager fixes for distributed execution #558

tammam1998 commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

angelhof left a comment

angelhof May 24, 2022

angelhof May 24, 2022

angelhof May 24, 2022

angelhof May 24, 2022

Eager fixes for distributed execution #558

Are you sure you want to change the base?

Eager fixes for distributed execution #558

Conversation

tammam1998 commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

github-actions bot commented May 20, 2022

angelhof left a comment

Choose a reason for hiding this comment

angelhof May 24, 2022

Choose a reason for hiding this comment

angelhof May 24, 2022

Choose a reason for hiding this comment

angelhof May 24, 2022

Choose a reason for hiding this comment

angelhof May 24, 2022

Choose a reason for hiding this comment