[Spark 4.0] Account for `PartitionedFileUtil.splitFiles` signature change. #10857

mythrocks · 2024-05-21T21:59:40Z

In Apache Spark 4.0, the signature of PartitionedFileUtil.splitFiles was changed to remove unused parameters (apache/spark@eabea643c74). This causes the Spark RAPIDS plugin build to break with Spark 4.0.

This commit introduces a shim to account for the signature change.

fixes #10299

Fixes NVIDIA#10299. In Apache Spark 4.0, the signature of `PartitionedFileUtil.splitFiles` was changed to remove unused parameters (apache/spark@eabea643c74). This causes the Spark RAPIDS plugin build to break with Spark 4.0. This commit introduces a shim to account for the signature change. Signed-off-by: MithunR <[email protected]>

mythrocks · 2024-05-22T01:43:14Z

Build

...n/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/PartitionedFileUtilShim.scala

Signed-off-by: MithunR <[email protected]>

mythrocks · 2024-05-22T06:07:54Z

Build

...lugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala

razajafri

LGTM. Just update the copyrights on one file

mythrocks · 2024-05-28T18:37:07Z

~~(Working on the style fixes.)~~

Edit: Fixed.

mythrocks · 2024-05-28T19:39:15Z

Build

sql-plugin/src/main/spark350/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala

…-spark-42821

mythrocks · 2024-05-29T17:27:06Z

I've merged up to pull in #10933.

razajafri · 2024-05-29T17:33:29Z

build

razajafri · 2024-05-29T18:00:27Z

CI failed with Databricks build result : FAILURE

razajafri · 2024-05-29T18:00:29Z

build

mythrocks · 2024-05-29T21:41:54Z

Ah, this PR is also failing CI on the following tangential problem:

2024-05-29T18:16:38.0437840Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330db/scala/org/apache/spark/rapids/execution/GpuSubqueryBroadcastMeta.scala:33: GpuSubqueryBroadcastMeta is already defined as class GpuSubqueryBroadcastMeta
2024-05-29T18:16:38.0438346Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR] class GpuSubqueryBroadcastMeta(
2024-05-29T18:16:38.0438690Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR]       ^
2024-05-29T18:16:38.0439076Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR] one error found
2024-05-29T18:16:38.0439680Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [INFO] ------------------------------------------------------------------------

#10945 should help.

razajafri · 2024-05-30T06:49:21Z

build

mythrocks · 2024-05-30T18:23:54Z

Looks like this will need special handling for Databricks:

/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:59: value length is not a member of Nothing
2024-05-30T07:07:13.7388140Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]       }.sortBy(_.length)(implicitly[Ordering[Long]].reverse)
2024-05-30T07:07:13.7389160Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]                  ^
2024-05-30T07:07:13.7391095Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:82: value sortBy is not a member of Array[Nothing]
2024-05-30T07:07:13.7393196Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] possible cause: maybe a semicolon is missing before `value sortBy'?
2024-05-30T07:07:13.7394447Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]     }.sortBy(_.length)(implicitly[Ordering[Long]].reverse)
2024-05-30T07:07:13.7395420Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]       ^
2024-05-30T07:07:13.7397343Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:40: local method canBeSplit in method splitFiles is never used
2024-05-30T07:07:13.7399511Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]     def canBeSplit(filePath: Path, hadoopConf: Configuration): Boolean = {
2024-05-30T07:07:13.7400581Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]         ^
2024-05-30T07:07:13.7402484Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:72: local val isSplitable in value $anonfun is never used
2024-05-30T07:07:13.7404528Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]         val isSplitable = relation.fileFormat.isSplitable(
2024-05-30T07:07:13.7405531Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]             ^

I'll get on this shortly.

…-spark-42821

mythrocks · 2024-05-31T00:03:12Z

Looks like this will need special handling for Databricks...

Barked up the wrong tree for a bit. This was only a missing import. Testing the fix now.

mythrocks · 2024-05-31T00:09:15Z

Build

mythrocks · 2024-05-31T07:20:27Z

This change has been merged. Thank you for the reviews, @razajafri, @NVnavkumar.

mythrocks added the audit_4.0.0 Audit related tasks for 4.0.0 label May 21, 2024

mythrocks self-assigned this May 21, 2024

mythrocks requested a review from razajafri May 22, 2024 01:23

razajafri reviewed May 22, 2024

View reviewed changes

...n/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/PartitionedFileUtilShim.scala Outdated Show resolved Hide resolved

Common base for PartitionFileUtilsShims.

5ab7e20

Signed-off-by: MithunR <[email protected]>

mythrocks mentioned this pull request May 22, 2024

[BUG] Update Plugin to use the new getPartitionedFile method #10606

Closed

mythrocks changed the title ~~Account for PartitionedFileUtil.splitFiles signature change.~~ [Spark 4.0] Account for PartitionedFileUtil.splitFiles signature change. May 22, 2024

Reusing existing PartitionedFileUtilsShims.

c98efb1

mythrocks requested a review from razajafri May 22, 2024 21:27

mythrocks mentioned this pull request May 22, 2024

[Spark 4.0] Account for CommandUtils.uncacheTableOrView signature change. #10863

Merged

More refactor, for pre-3.5 compile.

0685d9b

razajafri reviewed May 23, 2024

View reviewed changes

...lugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala Show resolved Hide resolved

razajafri reviewed May 23, 2024

View reviewed changes

Updated Copyright date.

62b8eac

Fixed style error.

8ad740c

mythrocks changed the base branch from branch-24.06 to branch-24.08 May 28, 2024 19:32

mythrocks requested a review from razajafri May 28, 2024 19:41

razajafri reviewed May 28, 2024

View reviewed changes

sql-plugin/src/main/spark350/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala Outdated Show resolved Hide resolved

mythrocks added 3 commits May 28, 2024 21:58

Re-fixed the copyright year.

3b7c632

Merge remote-tracking branch 'origin/branch-24.08' into spark40-audit…

9936696

…-spark-42821

Merge remote-tracking branch 'origin/branch-24.08' into spark40-audit…

09f0c46

…-spark-42821

mythrocks requested a review from razajafri May 29, 2024 17:27

razajafri previously approved these changes May 29, 2024

View reviewed changes

razajafri added Spark 4.0+ Spark 4.0+ issues and removed audit_4.0.0 Audit related tasks for 4.0.0 labels May 29, 2024

mythrocks added 2 commits May 30, 2024 16:18

Merge remote-tracking branch 'origin/branch-24.08' into spark40-audit…

375c99e

…-spark-42821

Added missing import.

b2ed2f4

mythrocks dismissed razajafri’s stale review via b2ed2f4 May 30, 2024 23:58

NVnavkumar approved these changes May 31, 2024

View reviewed changes

mythrocks merged commit 822ad9b into NVIDIA:branch-24.08 May 31, 2024
44 checks passed

mythrocks mentioned this pull request May 31, 2024

[AUDIT][SPARK-42821][SQL] Remove unused parameters in splitFiles methods #10299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark 4.0] Account for `PartitionedFileUtil.splitFiles` signature change. #10857

[Spark 4.0] Account for `PartitionedFileUtil.splitFiles` signature change. #10857

mythrocks commented May 21, 2024 •

edited by razajafri

mythrocks commented May 22, 2024

mythrocks commented May 22, 2024

razajafri left a comment

mythrocks commented May 28, 2024 •

edited

mythrocks commented May 28, 2024

mythrocks commented May 29, 2024

razajafri commented May 29, 2024

razajafri commented May 29, 2024

razajafri commented May 29, 2024

mythrocks commented May 29, 2024

razajafri commented May 30, 2024

mythrocks commented May 30, 2024

mythrocks commented May 31, 2024

mythrocks commented May 31, 2024

mythrocks commented May 31, 2024

[Spark 4.0] Account for PartitionedFileUtil.splitFiles signature change. #10857

[Spark 4.0] Account for PartitionedFileUtil.splitFiles signature change. #10857

Conversation

mythrocks commented May 21, 2024 • edited by razajafri

mythrocks commented May 22, 2024

mythrocks commented May 22, 2024

razajafri left a comment

Choose a reason for hiding this comment

mythrocks commented May 28, 2024 • edited

mythrocks commented May 28, 2024

mythrocks commented May 29, 2024

razajafri commented May 29, 2024

razajafri commented May 29, 2024

razajafri commented May 29, 2024

mythrocks commented May 29, 2024

razajafri commented May 30, 2024

mythrocks commented May 30, 2024

mythrocks commented May 31, 2024

mythrocks commented May 31, 2024

mythrocks commented May 31, 2024

[Spark 4.0] Account for `PartitionedFileUtil.splitFiles` signature change. #10857

[Spark 4.0] Account for `PartitionedFileUtil.splitFiles` signature change. #10857

mythrocks commented May 21, 2024 •

edited by razajafri

mythrocks commented May 28, 2024 •

edited