Add Platform test to show Hadoop and Tez partitioning difference #59

piyushnarang · 2017-02-11T01:59:58Z

Noticed this on one of our test jobs that we were using to compare the performance of MR and Tez.
I've built a unit test to show a subset of the graph where Cascading on Hadoop is combining more nodes and thus lowering the quantity of data streamed between nodes / steps.

The job starts off with two vertices V0, V1 reading around 3,025,369,753 tuples (10 odd TB). They're then merged + grouped in vertex V2. This is then passed on to Vertex V3 which performs some aggregations (everys) and reduces the data to around 1 TB.

In case of Hadoop, V0, V1 are done on the job's mappers. V2 + V3 are combined and done on the reducers. We then end up writing out this 1TB or so of data and that's picked up by the downstream steps.

Wondering if we should have a rule to collapse these aggregations into the step doing the groupBy?

cwensel · 2023-06-14T14:36:37Z

Leaving this open in the hope I have time to look into it, even though it's likely no longer a concern.

Add Platform test to show Hadoop and Tez partitioning difference

8168099

piyushnarang mentioned this pull request Sep 28, 2017

Add generic TypedPipe optimization rules twitter/scalding#1724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Platform test to show Hadoop and Tez partitioning difference #59

Add Platform test to show Hadoop and Tez partitioning difference #59

piyushnarang commented Feb 11, 2017

cwensel commented Jun 14, 2023

Add Platform test to show Hadoop and Tez partitioning difference #59

Are you sure you want to change the base?

Add Platform test to show Hadoop and Tez partitioning difference #59

Conversation

piyushnarang commented Feb 11, 2017

cwensel commented Jun 14, 2023