[Feature Request][Spark] Optimize automated batching #3081

Kimahriman · 2024-05-10T11:31:50Z

Feature request

Which Delta project/connector is this regarding?

Overview

Currently optimize is an all or nothing operation on all files in the table, or limited by a partition filter. The partition filter allows you to do manually batching of subsets of the table, but with clustering now a thing, there is no option to do partition filtering. We should add the ability to enable batch support inside of optimize, so chunks of optimized files can be added to the transaction log incrementally.

Motivation

Currently you could rewrite an entire petabyte of data, just to fail on the last file and have all that be for naught, wasting a lot of compute time and storage space. With automatic batching, nearly all of the results would be saved along the way, and only the last batch that failed would have to be retried.

Further details

I think this can be fairly straightforward, just grouping the existing bins into another layer of batches.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

Kimahriman · 2024-05-10T11:38:34Z

I think this is already a thing in Databricks, so it would be great to know if there are any plans to open-source that before I spend a bunch of time on this! @scottsand-db

felipepessoto · 2024-06-03T06:32:37Z

Adding a DRY RUN option could also help. If you know there are many files to optimize you can plan accordingly.

Kimahriman · 2024-06-03T22:00:40Z

Adding a DRY RUN option could also help. If you know there are many files to optimize you can plan accordingly.

Probably good for a separate issue/follow on. I don't really have that use case nor want to bloat the current PR

Kimahriman added the enhancement New feature or request label May 10, 2024

Kimahriman linked a pull request May 14, 2024 that will close this issue

[Spark] Optimize batching / incremental progress #3089

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request][Spark] Optimize automated batching #3081

[Feature Request][Spark] Optimize automated batching #3081

Kimahriman commented May 10, 2024

Kimahriman commented May 10, 2024

felipepessoto commented Jun 3, 2024

Kimahriman commented Jun 3, 2024

[Feature Request][Spark] Optimize automated batching #3081

[Feature Request][Spark] Optimize automated batching #3081

Comments

Kimahriman commented May 10, 2024

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Further details

Willingness to contribute

Kimahriman commented May 10, 2024

felipepessoto commented Jun 3, 2024

Kimahriman commented Jun 3, 2024