LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks] #10825

res-life · 2024-05-16T06:09:27Z

Closes #10862
Collecting nsys/ncu files are time-consuming when running customer data because usually customer data is huge.
Actually we only need a small data segment which is running in a JNI/cuDF kernel.
This PR aims to dump/replay the runtime(data and meta) when Project Exec(s) execution time on a batch exceeds the threshold time.

This is a feature to dump and replay Project Exec runtime (by column batch) for performance purpose debug.

Project exec
store/restore GpuTieredProject case class
store/restore ColumnarBatch data
replay GpuProejctExec

winningsix · 2024-05-20T12:00:45Z

docs/dev/replay-exec.md

+store files in. Remote path is supported e.g.: `hdfs://url:9000/path/to/save`
+
+```
+spark.conf.set("spark.rapids.sql.test.replay.exec.threshold.timeMS", 100)


Nit: Hmm, let's start with 1000 ms as a default value.

winningsix · 2024-05-20T12:03:59Z

docs/dev/replay-exec.md

+Only dump the column batches when it's executing time exceeds threshold time. 
+
+``` 
+spark.conf.set("spark.rapids.sql.test.replay.exec.maxBatchNum", 1)


nit: will be more suitable saying spark.rapids.sql.test.replay.batch.limit and mention it's per executor base or task.

winningsix · 2024-05-20T12:04:46Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -2188,6 +2188,40 @@ val SHUFFLE_COMPRESSION_LZ4_CHUNK_SIZE = conf("spark.rapids.shuffle.compression.
 .integerConf
 .createWithDefault(1024)

+ /**
+ * refer to dev doc: `replay-exec.md`
+ * only supports "project", will supports "agg" later


Nit: TODO. any ticket to link here?

winningsix · 2024-05-20T12:05:42Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

 override def otherCopyArgs: Seq[AnyRef] =
 Seq[AnyRef](useTieredProject.asInstanceOf[java.lang.Boolean])
+


nit: not necessary for extra lines?

winningsix · 2024-05-20T12:06:14Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

 override def output: Seq[Attribute] = projectList.map(_.toAttribute)

 override lazy val additionalMetrics: Map[String, GpuMetric] = Map(
 OP_TIME -> createNanoTimingMetric(MODERATE_LEVEL, DESCRIPTION_OP_TIME))

- override def internalDoExecuteColumnar() : RDD[ColumnarBatch] = {
+ override def internalDoExecuteColumnar(): RDD[ColumnarBatch] = {


Keep it, because it follows the code format.

winningsix · 2024-05-20T12:07:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

 val numOutputRows = gpuLongMetric(NUM_OUTPUT_ROWS)
 val numOutputBatches = gpuLongMetric(NUM_OUTPUT_BATCHES)
 val opTime = gpuLongMetric(OP_TIME)
 val boundProjectList = GpuBindReferences.bindGpuReferencesTiered(projectList, child.output,
 useTieredProject)

 val rdd = child.executeColumnar()
+
+ // This is for test purpose; dump project list
+ replayDumper.foreach(d => d.dumpMeta[GpuTieredProject]("GpuTieredProject", boundProjectList))


hide this behind dumpForReplay = true?

Renamed to replayDumperOpt, so it shows it's an option.
Now using replayDumperOpt to distinguish dumpForReplay = true , dumpForReplay = false

winningsix · 2024-05-20T12:09:42Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/test/ProjectExecReplayer.scala

+ f => f.getName.startsWith(s"${projectHash}_cb_data_") &&
+ f.getName.endsWith(".parquet"))
+ if (parquets == null || parquets.isEmpty) {
+ logError(s"Project Exec replayer: there is no cb_data_xxx.parquet file in $replayDir")


nit: add an usage method and replace all those logError with that?

res-life · 2024-05-21T11:30:14Z

build

res-life · 2024-05-22T04:27:08Z

Tested dump/replay with remote directory(e.g.: hdfs://path/to/dir) successfully.

res-life · 2024-05-22T07:14:23Z

build

tgravescs

I just mostly looked at docs, not full code, but here are some comments

docs/dev/replay-exec.md

tgravescs · 2024-05-22T13:15:08Z

docs/dev/replay-exec.md

+Set the following configurations to enable this feature: 
+
+``` 
+spark.conf.set("spark.rapids.sql.test.replay.exec.type", "project")


why is there a ".test." in the name of this config? isn't this for debugging, testing makes it sound not to be used in real workloads and for testing purposes

Now updated to .debug.

tgravescs · 2024-05-22T13:15:26Z

docs/dev/replay-exec.md

+spark.conf.set("spark.rapids.sql.test.replay.exec.type", "project")
+```
+Default `type` value is empty which means do not dump. 
+Set this `type` to `project` if you want to dump Project Exec runtime data. Currently only support


can you set it to multiple execs?

Updated to exec.types
Doc: Define the Exec types for dumping, separated by comma, e.g.: project,aggregate,sort.

tgravescs · 2024-05-22T13:15:40Z

docs/dev/replay-exec.md

+`project` and empty.
+
+```
+spark.conf.set("spark.rapids.sql.test.replay.exec.dumpDir", "file:/tmp")


same comment here, shouldn't have .test in the name of config

tgravescs · 2024-05-22T13:16:36Z

docs/dev/replay-exec.md

+spark.conf.set("spark.rapids.sql.test.replay.exec.dumpDir", "file:/tmp")
+```
+Default value is `file:/tmp`. 
+specify the dump directory, e.g.: `file:/tmp/my-debug-path`. 


can this be distributed filesystem, you should specify very specifically what all is supported. I think we do this in other places in docs for like dumping parquet files

Yes, can be distributed/remote path. Now we use hadoop FileSystem to open/write a file stream. If user specify the corresponding config to get access to remote file system, then our code can handle this.
Refer to https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/docs/dev/get-json-object-dump-tool.md

'spark.rapids.sql.expression.GetJsonObject.debugPath': '/tmp/DEBUG_JSON_DUMP/'

tgravescs · 2024-05-22T13:30:30Z

docs/dev/replay-exec.md

+```
+
+<path_to_saved_replay_dir> is the replay directory 
+<hash_code_of_project_exec> is the hash code for a specific Project Exec. Dumping may generate


this just replays one exec then? What does the replay do specifically?

What about other things you may want to collect like number of workers, executor sizes, etc. to try to get same environment, seems like that is also good data to have to try to reproduce same issue.

Typically, this tool is used to run a single batch (fits in memory) to debug underlying cuDF/JNI kernel, so it's not related to number of workers, executor sizes, etc.

thats fine but a user who uses this tool likely wants to get that other information and when they replay likely wants as close to customer setup as possible so making a comment about that would help I think.

tgravescs · 2024-05-22T13:30:53Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+ * refer to dev doc: `replay-exec.md`
+ * only supports "project" now
+ */
+ val TEST_REPLAY_EXEC_TYPE =


same comment as above I would rather see these called like DEBUG instead of test.

please change the defined from TEST to DEBUG_ as well or just remove it.
TEST_REPLAY_EXEC_TYPE -> REPLAY_EXEC_TYPE or DEBUG_REPLAY_EXEC_TYPE

res-life · 2024-05-23T10:29:46Z

build

tgravescs

A couple of high level questions. How does this scale to other execs? I assume you have to modify every single exec separately?

Also does this actually work for other execs - hash aggregate, join. What about when multiple batches come in and the exec generates multiple batches? What about if the exec carries some state across batches? Wondering if you have thought about all the different cases?

tgravescs · 2024-05-23T13:40:26Z

docs/dev/replay-exec.md

+```
+mvn clean install -DskipTests -pl dist -am -DallowConventionalDistJar=true -Dbuildver=330 
+```
+Note: Should specify `-DallowConventionalDistJar`, this config will make sure to generate a


I would rather see this first saying you must build with this option and then below that have an example and point to our build docs

tgravescs · 2024-05-23T13:41:37Z

docs/dev/replay-exec.md

+spark.conf.set("spark.rapids.sql.debug.replay.exec.types", "project")
+```
+Default `types` value is empty which means do not dump. 
+Define the Exec types for dumping, separated by comma, e.g.: `project,aggregate,sort`. 


can you specify for different types of joins and other execs - like broadcastjoin vs hashjoin?

For agg, sort, it's a future feature. This PR only handle project.
Will change to:

Default `types` value is empty which means do not dump. Define the Exec types for dumping, separated by comma, e.g.: `project`. Note currently only support `project`, so it's no need to specify comma.

tgravescs · 2024-05-23T13:43:43Z

docs/dev/replay-exec.md

+``` 
+spark.conf.set("spark.rapids.sql.debug.replay.exec.batch.limit", 1)
+```
+This config defines the max dumping number of column batches. 


what does this mean exactly, if it exceeds threshold of processing multiple batches then it dumps all of them? If it exceeds threshold of processing one batch then we dump that batch and others after it?

I assume this is column batches per exec?

Yes, it's a limit for per Exec instance.
If current batches exceeds the max limit, then the batches will be skipped to dump.

tgravescs · 2024-05-23T13:45:20Z

docs/dev/replay-exec.md

+After the job is done, check the dump path will have files like:
+```
+/tmp/replay-exec:
+ - xxx_GpuTieredProject.meta // this is serialized GPU Tiered Project case class 


if I have a huge job that has say 10000 projects is this directory going to become to big? I assume the job keeps running after dumping this out, correct? Is there a way to limit the total number you get?

if I have a huge job that has say 10000 projects is this directory going to become to big?

Yes, it will be big. We have a config threshold.timeMS to reduce the number of batches to dump.
Typically usage is: Running the query at customer env, then check the eventlog to find a threshold time; then run at customer env again with dump enabled and with this threshold time. Usually we only need one batch to reproduce the slowness, and reproduce with NSYS.

I assume the job keeps running after dumping this out, correct?

Yes. Keep runnig, if it's needed we can terminate the process when the first dumping is done.

Is there a way to limit the total number you get?

No. In future will collect batches for multiple Execs. We will do not know how many slowness batches in advance.

tgravescs · 2024-05-23T13:50:03Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+ * refer to dev doc: `replay-exec.md`
+ * only supports "project" now
+ */
+ val TEST_REPLAY_EXEC_TYPE =


please change the defined from TEST to DEBUG_ as well or just remove it.
TEST_REPLAY_EXEC_TYPE -> REPLAY_EXEC_TYPE or DEBUG_REPLAY_EXEC_TYPE

res-life · 2024-05-30T08:06:56Z

A couple of high level questions. How does this scale to other execs? I assume you have to modify every single exec separately?

Yes. This PR is only dump Project which is a simple Exec.

Also does this actually work for other execs - hash aggregate, join.

Do not work for hash aggregate, join.

What about when multiple batches come in and the exec generates multiple batches?

Does cover in this PR. For agg Exec, it does handle multiple batches.

What about if the exec carries some state across batches? Wondering if you have thought about all the different cases?

Allen and I went into Hash Agg, and found what you mentioned.
Currently, It's not easy to complete all the Execs.
This PR is a start of dumping from project exec.

Maybe it's a follow-up to have a convenient method to handle all the Execs.

res-life · 2024-05-30T08:51:05Z

build

res-life · 2024-06-03T08:28:29Z

Changed to draft, because there are error when testing Dataproc.

Signed-off-by: Chong Gao <[email protected]>

res-life · 2024-06-04T10:17:01Z

build

res-life requested review from winningsix and wjxiz1992 May 16, 2024 06:09

res-life changed the title ~~[Do not review] [WIP] Add store and replay exec env feature~~ [Do not review] [WIP] Add store and replay exec runtime feature May 16, 2024

res-life force-pushed the replay-exec branch 3 times, most recently from d704928 to 2702c0a Compare May 17, 2024 12:36

res-life changed the title ~~[Do not review] [WIP] Add store and replay exec runtime feature~~ [WIP] LOcal REplay framework: dump and replay GpuProjectExec runtime May 20, 2024

res-life force-pushed the replay-exec branch from 2702c0a to c87e052 Compare May 20, 2024 06:01

winningsix reviewed May 20, 2024

View reviewed changes

res-life force-pushed the replay-exec branch from 9855fc6 to 070d016 Compare May 21, 2024 11:29

res-life changed the title ~~[WIP] LOcal REplay framework: dump and replay GpuProjectExec runtime~~ [WIP] LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks] May 21, 2024

res-life force-pushed the replay-exec branch 2 times, most recently from dee4dfc to fee9e38 Compare May 22, 2024 03:25

res-life marked this pull request as ready for review May 22, 2024 04:25

res-life changed the title ~~[WIP] LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks]~~ LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks] May 22, 2024

tgravescs reviewed May 22, 2024

View reviewed changes

tgravescs reviewed May 23, 2024

View reviewed changes

sameerz added the tools label May 28, 2024

res-life changed the base branch from branch-24.06 to branch-24.08 May 29, 2024 00:59

res-life mentioned this pull request May 30, 2024

Case when performance improvement: reduce the copy_if_else #10951

Open

res-life marked this pull request as draft June 3, 2024 08:08

res-life force-pushed the replay-exec branch 2 times, most recently from 06aeb96 to 5742c4b Compare June 4, 2024 10:14

Add dump and replay project exec feature

2864011

Signed-off-by: Chong Gao <[email protected]>

res-life force-pushed the replay-exec branch from 5742c4b to 2864011 Compare June 4, 2024 10:16

res-life mentioned this pull request Jun 5, 2024

[FEA] Support ProjectExec in LoRe framework #10862

Open

res-life marked this pull request as ready for review June 5, 2024 02:34

		override def otherCopyArgs: Seq[AnyRef] =
		Seq[AnyRef](useTieredProject.asInstanceOf[java.lang.Boolean])

LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks] #10825

Are you sure you want to change the base?

LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks] #10825

Conversation

res-life commented May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented May 21, 2024

res-life commented May 22, 2024

res-life commented May 22, 2024

tgravescs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented May 23, 2024

tgravescs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented May 30, 2024

res-life commented May 30, 2024

res-life commented Jun 3, 2024

res-life commented Jun 4, 2024

res-life commented May 16, 2024 •

edited