HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

difin · 2024-05-13T21:36:45Z

…ution

What changes were proposed in this pull request?

Adding support for compacting a given partition of a Hive Iceberg table even if the table has undergone partition evolution. The partition spec can be current or one of the older partition specs of the table.

Why are the changes needed?

So far compaction on partition level wasn't supported for Hive Iceberg tables that have undergone partition evolution.

Does this PR introduce any user-facing change?

Yes. Users can now submit partition-level compaction requests for Hive Iceberg tables with partition spec that conforms to one of the previous partition specs in the table.

Is the change a dependency upgrade?

No

How was this patch tested?

New q-tests added

deniskuzZ · 2024-05-27T12:30:08Z

ql/src/java/org/apache/hadoop/hive/ql/ddl/table/storage/compact/AlterTableCompactOperation.java

 throw new HiveException(ErrorMsg.INVALID_PARTITION_SPEC);
 }
+ partitions = partitions.stream().filter(part -> part.getSpec().size() == partitionSpec.size()).collect(Collectors.toList());


what are we checking here? number of partitions in table spec and compaction request?

This validates that the partition spec given in the compaction command matches exactly one partition in the table, not a partial partition spec.

Let's say, a table has partitions with specs (a,b) and (a,b,c) because of evolution and a compaction command is run with spec (a,b). On line 144 it will find both partition specs and after filtering it will have only one (a,b) and will pass validation.

Another case, let's assume a table has the same partitions with specs (a,b) and (a,b,c) and a compaction command is run with spec (a). On line 144 it will find both partition specs and after filtering it will have zero partitions and will fail validation with TOO_MANY_COMPACTION_PARTITIONS exception.

deniskuzZ · 2024-05-27T12:33:09Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

- commitOverwrite(table, branchName, startTime, filesForCommit, rewritePolicy);
+ Integer compactionSpecId = outputTable.jobContexts.stream()
+ .findAny()
+ .map(x -> x.getJobConf().get(CompactorContext.COMPACTION_SPEC_ID))


is that partition spec id? name is confusing

Fixed to make the name more clear

deniskuzZ · 2024-05-27T12:34:25Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+ .map(x -> x.getJobConf().get(CompactorContext.COMPACTION_PARTITION_PATH))
+ .orElse(null);
+
+ commitOverwrite(table, branchName, startTime, filesForCommit, rewritePolicy, compactionSpecId,


should we handle compaction separately, in diff method?

I was also thinking about it; Done.

…ution

SourabhBadhya · 2024-05-31T06:16:10Z

ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorContext.java

+ public static final String COMPACTION_PART_SPEC_ID = "compaction_part_spec_id";
+ public static final String COMPACTION_PARTITION_PATH = "compaction_partition_path";
+


I think its better to move such constants to Iceberg specific classes since I see this being used only in Iceberg right now.

SourabhBadhya · 2024-05-31T06:26:37Z

...-handler/src/main/java/org/apache/iceberg/mr/hive/compaction/IcebergMajorQueryCompactor.java

+ throw new HiveException("Invalid partition spec, no corresponding spec_id found");
+ }
+
+ int specId = partitionList.get(0).second();


Should we add a clause that this list should have only one partition in it. If not, throw exception?

SourabhBadhya · 2024-05-31T06:30:19Z

...-handler/src/main/java/org/apache/iceberg/mr/hive/compaction/IcebergMajorQueryCompactor.java

@@ -81,11 +128,12 @@ public boolean run(CompactorContext context) throws IOException, HiveException,
 .filter(col -> !partSpecMap.containsKey(col))
 .collect(Collectors.joining(","));

- compactionQuery = String.format("insert overwrite table %1$s partition(%2$s) select %4$s from %1$s where %3$s",
+ compactionQuery = String.format("insert overwrite table %1$s partition(%2$s) " +
+ "select %4$s from %1$s where %3$s and partition__spec__id = %5$d",


Can we try to use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id?
This would indicate that we are using a virtual column.

SourabhBadhya · 2024-05-31T06:36:42Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+ .map(x -> x.getJobConf().get(CompactorContext.COMPACTION_PARTITION_PATH))
+ .orElse(null);
+
+ if (rewritePolicy != RewritePolicy.DEFAULT || compactionPartSpecId != null) {


What is the rewrite policy in this case? Since I see only 2 enums - DEFAULT & ALL_PARTITIONS. Is there a chance that this can be null.

Added new value 'PARTITION' as it was useful for handling your last review comment regarding validatePartSpec method.

SourabhBadhya · 2024-05-31T06:40:16Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+ LOG.info("Compaction commit took {} ms for table: {} with {} file(s)", System.currentTimeMillis() - startTime,
+ table, results.dataFiles().size());
+ } else {
+ LOG.info("Empty compaction commit, took {} ms for table: {}", System.currentTimeMillis() - startTime, table);


Does compaction ever reach this statement? Also the log stmt seems shady, there is no commit that has happened on the table when 0 files are present.

I thought if a table is empty, then when trying to compact it might reach this place.
If there is no data, then there is nothing to commit, that's why there is no commit.

SourabhBadhya · 2024-05-31T06:43:25Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+ * @param results The object containing the new files
+ */
+ private void commitOverwrite(Table table, String branchName, long startTime, FilesForCommit results) {
+ Preconditions.checkArgument(results.deleteFiles().isEmpty(), "Can not handle deletes with overwrite");


What is the idea behind this check?

I am not sure and it wasn't added as part of compaction, it was there before.

SourabhBadhya · 2024-05-31T07:00:57Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

@@ -1868,7 +1872,8 @@ public void validatePartSpec(org.apache.hadoop.hive.ql.metadata.Table hmsTable,
 }

 Map<String, Types.NestedField> mapOfPartColNamesWithTypes = Maps.newHashMap();
- for (PartitionField partField : table.spec().fields()) {
+ List<PartitionField> allPartFields = IcebergTableUtil.getAllPartitionFields(table);


getAllPartitionFields essentially returns all columns across different specs of the table. Whereas validatePartSpec API is used in many places where current table spec is expected. Hence I think this is incorrect.

Doing this might allow performing -
insert into table <tableName> partition (previous partition specs) ...
which should not be allowed.

Done. Added 2 new methods 'validatePartAnySpec' and 'getPartitionAnySpec' which is needed for Iceberg compaction partition level which operates on all specs of a table, not only on the latest one.

… by spec by any past table specs. Moved Iceberg compaction constant to a class in Iceberg module. Use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id.

sonarcloud · 2024-05-31T21:59:19Z

Quality Gate passed

Issues
13 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

asf-ci-hive added tests pending tests unstable and removed tests pending labels May 13, 2024

difin force-pushed the compaction_part_evol branch from 99d309d to 4c36293 Compare May 14, 2024 01:08

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels May 14, 2024

difin force-pushed the compaction_part_evol branch from 4c36293 to 746d1f2 Compare May 14, 2024 23:43

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels May 14, 2024

difin force-pushed the compaction_part_evol branch from 746d1f2 to abcf246 Compare May 15, 2024 15:02

github-actions bot requested a review from miklosgergely May 15, 2024 15:03

asf-ci-hive added tests pending and removed tests passed labels May 15, 2024

difin force-pushed the compaction_part_evol branch from abcf246 to 7398bd5 Compare May 15, 2024 15:55

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels May 15, 2024

difin force-pushed the compaction_part_evol branch from 7398bd5 to c518d81 Compare May 16, 2024 14:21

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 16, 2024

deniskuzZ reviewed May 27, 2024

View reviewed changes

difin force-pushed the compaction_part_evol branch from c518d81 to 9e2301c Compare May 29, 2024 22:10

asf-ci-hive added tests pending tests failed and removed tests passed tests pending labels May 29, 2024

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol…

983403c

…ution

difin force-pushed the compaction_part_evol branch from 9e2301c to 983403c Compare May 29, 2024 22:42

asf-ci-hive added tests pending tests passed and removed tests failed tests pending labels May 30, 2024

SourabhBadhya requested changes May 31, 2024

View reviewed changes

Added methods for validating partition spec and retrieving partitions…

cbd6a22

… by spec by any past table specs. Moved Iceberg compaction constant to a class in Iceberg module. Use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id.

asf-ci-hive added tests pending and removed tests passed labels May 31, 2024

asf-ci-hive added tests unstable and removed tests pending labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

difin commented May 13, 2024 •

edited

deniskuzZ May 27, 2024

difin May 29, 2024

deniskuzZ May 27, 2024

difin May 29, 2024

deniskuzZ May 27, 2024

difin May 29, 2024

SourabhBadhya May 31, 2024

difin May 31, 2024

SourabhBadhya May 31, 2024

SourabhBadhya May 31, 2024

difin May 31, 2024

SourabhBadhya May 31, 2024

difin May 31, 2024

SourabhBadhya May 31, 2024

difin May 31, 2024

SourabhBadhya May 31, 2024

difin May 31, 2024

SourabhBadhya May 31, 2024

difin May 31, 2024

sonarcloud bot commented May 31, 2024

		public static final String COMPACTION_PART_SPEC_ID = "compaction_part_spec_id";
		public static final String COMPACTION_PARTITION_PATH = "compaction_partition_path";

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

Are you sure you want to change the base?

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

Conversation

difin commented May 13, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented May 31, 2024

Quality Gate passed

difin commented May 13, 2024 •

edited