Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

difin
Copy link
Contributor

@difin difin commented May 13, 2024

…ution

What changes were proposed in this pull request?

Adding support for compacting a given partition of a Hive Iceberg table even if the table has undergone partition evolution. The partition spec can be current or one of the older partition specs of the table.

Why are the changes needed?

So far compaction on partition level wasn't supported for Hive Iceberg tables that have undergone partition evolution.

Does this PR introduce any user-facing change?

Yes. Users can now submit partition-level compaction requests for Hive Iceberg tables with partition spec that conforms to one of the previous partition specs in the table.

Is the change a dependency upgrade?

No

How was this patch tested?

New q-tests added

throw new HiveException(ErrorMsg.INVALID_PARTITION_SPEC);
}
partitions = partitions.stream().filter(part -> part.getSpec().size() == partitionSpec.size()).collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we checking here? number of partitions in table spec and compaction request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validates that the partition spec given in the compaction command matches exactly one partition in the table, not a partial partition spec.

Let's say, a table has partitions with specs (a,b) and (a,b,c) because of evolution and a compaction command is run with spec (a,b). On line 144 it will find both partition specs and after filtering it will have only one (a,b) and will pass validation.

Another case, let's assume a table has the same partitions with specs (a,b) and (a,b,c) and a compaction command is run with spec (a). On line 144 it will find both partition specs and after filtering it will have zero partitions and will fail validation with TOO_MANY_COMPACTION_PARTITIONS exception.

commitOverwrite(table, branchName, startTime, filesForCommit, rewritePolicy);
Integer compactionSpecId = outputTable.jobContexts.stream()
.findAny()
.map(x -> x.getJobConf().get(CompactorContext.COMPACTION_SPEC_ID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that partition spec id? name is confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to make the name more clear

.map(x -> x.getJobConf().get(CompactorContext.COMPACTION_PARTITION_PATH))
.orElse(null);

commitOverwrite(table, branchName, startTime, filesForCommit, rewritePolicy, compactionSpecId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle compaction separately, in diff method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about it; Done.

Comment on lines 33 to 35
public static final String COMPACTION_PART_SPEC_ID = "compaction_part_spec_id";
public static final String COMPACTION_PARTITION_PATH = "compaction_partition_path";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to move such constants to Iceberg specific classes since I see this being used only in Iceberg right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

throw new HiveException("Invalid partition spec, no corresponding spec_id found");
}

int specId = partitionList.get(0).second();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a clause that this list should have only one partition in it. If not, throw exception?

@@ -81,11 +128,12 @@ public boolean run(CompactorContext context) throws IOException, HiveException,
.filter(col -> !partSpecMap.containsKey(col))
.collect(Collectors.joining(","));

compactionQuery = String.format("insert overwrite table %1$s partition(%2$s) select %4$s from %1$s where %3$s",
compactionQuery = String.format("insert overwrite table %1$s partition(%2$s) " +
"select %4$s from %1$s where %3$s and partition__spec__id = %5$d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id?
This would indicate that we are using a virtual column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.map(x -> x.getJobConf().get(CompactorContext.COMPACTION_PARTITION_PATH))
.orElse(null);

if (rewritePolicy != RewritePolicy.DEFAULT || compactionPartSpecId != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rewrite policy in this case? Since I see only 2 enums - DEFAULT & ALL_PARTITIONS. Is there a chance that this can be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new value 'PARTITION' as it was useful for handling your last review comment regarding validatePartSpec method.

LOG.info("Compaction commit took {} ms for table: {} with {} file(s)", System.currentTimeMillis() - startTime,
table, results.dataFiles().size());
} else {
LOG.info("Empty compaction commit, took {} ms for table: {}", System.currentTimeMillis() - startTime, table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does compaction ever reach this statement? Also the log stmt seems shady, there is no commit that has happened on the table when 0 files are present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought if a table is empty, then when trying to compact it might reach this place.
If there is no data, then there is nothing to commit, that's why there is no commit.

* @param results The object containing the new files
*/
private void commitOverwrite(Table table, String branchName, long startTime, FilesForCommit results) {
Preconditions.checkArgument(results.deleteFiles().isEmpty(), "Can not handle deletes with overwrite");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure and it wasn't added as part of compaction, it was there before.

@@ -1868,7 +1872,8 @@ public void validatePartSpec(org.apache.hadoop.hive.ql.metadata.Table hmsTable,
}

Map<String, Types.NestedField> mapOfPartColNamesWithTypes = Maps.newHashMap();
for (PartitionField partField : table.spec().fields()) {
List<PartitionField> allPartFields = IcebergTableUtil.getAllPartitionFields(table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAllPartitionFields essentially returns all columns across different specs of the table. Whereas validatePartSpec API is used in many places where current table spec is expected. Hence I think this is incorrect.

Doing this might allow performing -
insert into table <tableName> partition (previous partition specs) ...
which should not be allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added 2 new methods 'validatePartAnySpec' and 'getPartitionAnySpec' which is needed for Iceberg compaction partition level which operates on all specs of a table, not only on the latest one.

… by spec by any past table specs.

Moved Iceberg compaction constant to a class in Iceberg module.

Use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id.
Copy link

sonarcloud bot commented May 31, 2024

Quality Gate Passed Quality Gate passed

Issues
13 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants