-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request][Spark] Optimize automated batching #3081
Comments
I think this is already a thing in Databricks, so it would be great to know if there are any plans to open-source that before I spend a bunch of time on this! @scottsand-db |
Adding a DRY RUN option could also help. If you know there are many files to optimize you can plan accordingly. |
Probably good for a separate issue/follow on. I don't really have that use case nor want to bloat the current PR |
Feature request
Which Delta project/connector is this regarding?
Overview
Currently optimize is an all or nothing operation on all files in the table, or limited by a partition filter. The partition filter allows you to do manually batching of subsets of the table, but with clustering now a thing, there is no option to do partition filtering. We should add the ability to enable batch support inside of optimize, so chunks of optimized files can be added to the transaction log incrementally.
Motivation
Currently you could rewrite an entire petabyte of data, just to fail on the last file and have all that be for naught, wasting a lot of compute time and storage space. With automatic batching, nearly all of the results would be saved along the way, and only the last batch that failed would have to be retried.
Further details
I think this can be fairly straightforward, just grouping the existing bins into another layer of batches.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: