[Spark] Optimize batching / incremental progress #3089

Kimahriman · 2024-05-14T00:49:58Z

Which Delta project/connector is this regarding?

Description

Resolves #3081

Adds support for splitting an optimize run into batches with a new config spark.databricks.delta.optimize.batchSize. Batches will be created by grouping existing bins into groups until batchSize is reached. The default behavior remains the same, and batching is only enabled if the batchSize is configured.

This will apply to all optimization paths. I don't see any reason it shouldn't apply to to compaction, z-ordering, clustering, auto-compaction, or reorg/DV rewriting if a user configures it.

The way transactions are handled within the optimize executor had to be updated. Instead of creating a transaction upfront, we list all the files in the most recent snapshot, and then create transactions for each batch.

This is very important to add for clustering, as there is no way to manually do a partial set of the table using partition filtering. This could cause a lot of execution time and storage space to be wasted if something fails before optimizing the entire table finishes.

How was this patch tested?

A simple new UT is added. I can add others as well, just looking for some feedback on the approach and suggestions of what other tests to add.

Does this PR introduce any user-facing changes?

Yes, adds new capability to optimization that is disabled by default.

Kimahriman · 2024-05-14T00:52:19Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

+
+ val filesToProcess = bins.flatMap(_._2)
+
+ txn.trackFilesRead(filesToProcess)


This should be the only difference with the existing default behavior. Only the filtered candidates are registered with the transaction, not all matching files.

Kimahriman · 2024-05-14T00:56:57Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

+ val filesToProcess = bins.flatMap(_._2)
+
+ txn.trackFilesRead(filesToProcess)
+ txn.trackReadPredicates(partitionPredicate)


I assume the partition predicate should still be registered even if it's just a subset of the partition that's being processed. This is already what's happening anyway with the candidates being filtered

Yes, this looks correct to me compared with filtering code through OptimisticTransaction

dabao521

Thanks for the change, I left a few comments and we need more tests for this change.

dabao521 · 2024-05-29T04:05:48Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

@@ -61,17 +62,17 @@ abstract class OptimizeTableCommandBase extends RunnableCommand with DeltaComman
 */
 def validateZorderByColumns(
 spark: SparkSession,
- txn: OptimisticTransaction,


can you delete the obsolete comment above for txn and replace with snapshot?

* @param txn the [[OptimisticTransaction]] being used to optimize

dabao521 · 2024-05-29T04:20:11Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

@@ -273,34 +280,17 @@ class OptimizeExecutor(

 val maxThreads =


This variable can be removed now?

Yep good catch

dabao521 · 2024-05-29T04:27:16Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

+ bins: Seq[(Map[String, String], Seq[AddFile])],
+ batchSize: Long)
+ : Seq[Seq[(Map[String, String], Seq[AddFile])]] = {
+ val batches = new ArrayBuffer[Seq[(Map[String, String], Seq[AddFile])]]()


Now it has multiple nested containers, can we add a named type for Seq[(Map[String, String], Seq[AddFile])] and (Map[String, String], Seq[AddFile]) to make it more readable?

Something like

case class Bin(partitionValue: Map[String, String], files: Seq[AddFile]) case class Batch(bins: Seq[Bin])

or similiar

Yep thought these were getting verbose. Added case classes.

dabao521 · 2024-05-29T04:36:31Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

+ val filesToProcess = bins.flatMap(_._2)
+
+ txn.trackFilesRead(filesToProcess)
+ txn.trackReadPredicates(partitionPredicate)


Yes, this looks correct to me compared with filtering code through OptimisticTransaction

dabao521 · 2024-05-29T04:57:19Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

@@ -309,10 +299,10 @@ class OptimizeExecutor(
 optimizeStats.totalConsideredFiles = candidateFiles.size
 optimizeStats.totalFilesSkipped = optimizeStats.totalConsideredFiles - removedFiles.size
 optimizeStats.totalClusterParallelism = sparkSession.sparkContext.defaultParallelism
- val numTableColumns = txn.snapshot.metadata.schema.size
+ val numTableColumns = snapshot.metadata.schema.size


Shall we rename optimizeStats.numBatches at line 298 to optimizeStats.numBins since it is not batch any more with this change? Also we probably want to add optimizeStats.numBins = jobs.size .

Yeah that makes sense, I also kept a numBatches to actually represent the number of batches? Or is that bad because it changes the meaning of the existing stat

dabao521 · 2024-05-29T05:13:40Z

spark/src/main/scala/org/apache/spark/sql/delta/hooks/AutoCompact.scala

- val rows = new OptimizeExecutor(spark, txn, partitionPredicates, Seq(), true, optimizeContext)
- .optimize()
+ val rows = new OptimizeExecutor(spark, deltaLog.update(), catalogTable, partitionPredicates,
+ Seq(), true, optimizeContext).optimize()


nit: I know this is pre-existing, can you spell out the arguments for Seq() and true for better readablity?

dabao521 · 2024-05-29T05:14:19Z

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

+ .internal()
+ .doc(
+ """
+ |The size of a batch within an OPTIMIZE JOB. After a batch is complete, it's


Suggested change

|The size of a batch within an OPTIMIZE JOB. After a batch is complete, it's

|The size of a batch within an OPTIMIZE JOB. After a batch is complete, its

dabao521 · 2024-05-29T18:56:23Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

- }
+ val batchResults = batchSize match {
+ case Some(size) =>
+ groupBinsIntoBatches(jobs, size).map(runOptimizeBatch(_, maxFileSize))


Correct me if I am wrong. The files are binpacked in the following steps:

groupFilesIntoBins: bin pack files according to OptimizeTableStrategy.maxBinSize and respecting the partition boundaries.

groupBinsIntoBatches : Group multiple bins into one batch. Multiple partitioned bins can come to the same batch.

So each transaction can have data from multiple partitions. the ConflictChecker rejects one transaction if two txns writting to the same partition. It won't be a problem for those txns within the single OPTIMIZE since batches are executed in serialized order. For concurrent OPTIMIZE commands, we can consider each batch only include single partitioend data so we can minimize the chance conflict from concurrent OPTIMIZE.

Hmm didn't look at the ConflictChecker when working on this, I did have in my mind an improvement to doing multiple batches simultaneously/overlapping to prevent the tail of execution for each batch. There's no conceptual reason the commits should conflict, since they are reading specific files and not changing data. If they would conflict that might just be an improvement that should be made in the ConflictChecker

Though since I'm creating a transaction off the same original snapshot, even doing them serially they should invoke the same conflict checking against themselves right? And the current simple test doesn't seem to have an issue. Will see if I encounter anything in further tests

Actually I guess the Optimize command does it's own conflict checking to resolve that

dabao521 · 2024-05-29T19:33:32Z

spark/src/test/scala/org/apache/spark/sql/delta/optimize/OptimizeCompactionSuite.scala

+ assert(files.values.forall(_.length == 1))
+ // The last 5 commits in the history should be optimize batches, one for each partition
+ val commits = deltaLog.history.getHistory(None)
+ assert(commits.filter(_.operation == "OPTIMIZE").length == optimizeCommits)


Can you add a check the data before and after OPTIMIZE should be the same?

dabao521 · 2024-05-29T19:34:23Z

spark/src/test/scala/org/apache/spark/sql/delta/optimize/OptimizeCompactionSuite.scala

@@ -536,6 +536,39 @@ trait OptimizeCompactionSuiteBase extends QueryTest
 }
 }

+ test("optimize command with batching") {


Can you add more tests to cover:

OPTIMIZE WHERE for patititioned table. This makes sure the batching works correctly with the filter.

Since this change also impacts zorder by and cluster by, we need to add tests for both of them to validate batching works as expected.

OPTIMIZE on an empty table. Make sure it doens't trigger any divide by zero errors.

Thanks I'll work on those

Kimahriman added 7 commits May 10, 2024 09:03

Start working batching in optimize

60c84d2

More work

e3f60cf

Compiling

466f42a

Fix file tracking for retried commit

60dc17f

Add test

b50aab6

Merge branch 'master' into optimize-batching

2ce629e

Test large batch size too

5e3e7de

Kimahriman commented May 14, 2024

View reviewed changes

dabao521 reviewed May 29, 2024

View reviewed changes

Create case classes, fix docs, add stat

7bd5497

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Optimize batching / incremental progress #3089

[Spark] Optimize batching / incremental progress #3089

Kimahriman commented May 14, 2024

Kimahriman May 14, 2024

Kimahriman May 14, 2024

dabao521 May 29, 2024

dabao521 left a comment

dabao521 May 29, 2024

Kimahriman May 29, 2024

dabao521 May 29, 2024

Kimahriman May 29, 2024

dabao521 May 29, 2024

Kimahriman May 29, 2024

dabao521 May 29, 2024

dabao521 May 29, 2024

Kimahriman May 29, 2024

dabao521 May 29, 2024

dabao521 May 29, 2024

dabao521 May 29, 2024

Kimahriman May 29, 2024

Kimahriman May 29, 2024

Kimahriman May 29, 2024

dabao521 May 29, 2024

Kimahriman May 29, 2024

dabao521 May 29, 2024

Kimahriman May 29, 2024


		val filesToProcess = bins.flatMap(_._2)

		txn.trackFilesRead(filesToProcess)

		@@ -273,34 +280,17 @@ class OptimizeExecutor(

		val maxThreads =

	\|The size of a batch within an OPTIMIZE JOB. After a batch is complete, it's
	\|The size of a batch within an OPTIMIZE JOB. After a batch is complete, its

[Spark] Optimize batching / incremental progress #3089

Are you sure you want to change the base?

[Spark] Optimize batching / incremental progress #3089

Conversation

Kimahriman commented May 14, 2024

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dabao521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment