You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a Delta table partitioned by the country column. When running a Delta table update job to update and insert new data for a specific partition value, e.g., country = 'c1', we observed that the data in other partitions is also being rewritten, effectively rewriting the entire table, including partitions that should remain untouched.
This is the DeltaMergeBuilder I am using. Just to put into context my updates dataframe only contains data for the single country partition I am processing
The delta log json when using whenNotMatchedBySourceDelete
The data for all the partitions is getting duplicated, all the partitions are touched
The delta log json when using whenNotMatchedBySourceDelete is not used (commented out)
The data for all the partitions is not getting duplicated, only the single partition as desired is touched
Expected results
Only the partition being updated should be rewritten instead of the whole table
Environment information
Delta Lake version: io.delta:delta-core_2.12:2.3.0
Spark version: 3.3.1
Scala version: 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
No. I cannot contribute a bug fix at this time.
The text was updated successfully, but these errors were encountered:
yatharth-zeotap
changed the title
[BUG] Unintended Rewrite of Other Partitions During Partition-Level Delta Table Update
[BUG] [SPARK] Unintended Rewrite of Other Partitions During Partition-Level Delta Table Update
May 6, 2024
Bug
Which Delta project/connector is this regarding?
Describe the problem
We have a Delta table partitioned by the country column. When running a Delta table update job to update and insert new data for a specific partition value, e.g., country = 'c1', we observed that the data in other partitions is also being rewritten, effectively rewriting the entire table, including partitions that should remain untouched.
Earlier too we were using the same DeltaMergeBuilder construct(delta version: 2.1.0) except the whenNotMatchedBySource clause and that used to work as per expectations.
(originally reported on: https://delta-users.slack.com/archives/CJ70UCSHM/p1714990170284469)
Steps to reproduce
This is the DeltaMergeBuilder I am using. Just to put into context my updates dataframe only contains data for the single country partition I am processing
Observed results
The delta log json when using whenNotMatchedBySourceDelete
The delta log json when using whenNotMatchedBySourceDelete is not used (commented out)
Expected results
Only the partition being updated should be rewritten instead of the whole table
Environment information
Delta Lake version: io.delta:delta-core_2.12:2.3.0
Spark version: 3.3.1
Scala version: 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: