avoid shuffle? #388

tooptoop4 · 2020-10-14T14:06:09Z

i read csv input then write parquet with partitionby, it takes a long time. any settings u recommend ? maybe https://issues.apache.org/jira/browse/SPARK-24940 ?

lyogev · 2020-10-15T11:15:58Z

So if you're reading a single CSV and perform partitionby the first part will be a single partition.
You can either repartition using the hints in the link or create a multi config job with a CSV repartitioned to parquet (using the repartition flag in the parquet output) and then reading and running your business logic on top of it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid shuffle? #388

avoid shuffle? #388

tooptoop4 commented Oct 14, 2020

lyogev commented Oct 15, 2020

avoid shuffle? #388

avoid shuffle? #388

Comments

tooptoop4 commented Oct 14, 2020

lyogev commented Oct 15, 2020