HIVE-28259: Common table expression detection and rewrites using CBO #5249

zabetak · 2024-05-14T14:21:32Z

What changes were proposed in this pull request?

Add applyCteRewriting phase in CalcitePlanner for detecting and using CTEs; ensure rewrite logic is consistent with existing hive.optimize.cte.materialize.* properties.
Model CTEs as materialized views (MVs) and add utility method in HiveMaterializedViewUtils for mapping a CTE to a RelOptMaterialization.
Refactor core MV rewrite logic in CalcitePlanner to use during CTE rewrite and exploit CTEs in a cost-based manner.
Add HiveTableSpool operator to represent CTEs and handle them in the plan using new rules: TableScanToSpoolRule and RemoveUnusedCteRule.
Add TableScanToSpoolRule, and RemoveUnusedCteRule to add/remove spools from the plan.
Enhance/Enrich metadata handlers for handling the Spool operator.
Add AggregatedColumns metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with hive.optimize.cte.materialize.full.aggregate.only property.
Add HiveSqlTypeUtil.containsSqlType for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217).
Add hive.optimize.cte.suggester.class and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath.
Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment.
Map spool(s) to WITH clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752).
Modify (slighly) SemanticAnalyzer/CalcitePlanner to enable AST-based CTE materialization (getMetadata) post CBO run.

Why are the changes needed?

Open the road for cost-based CTE optimizations in Hive
Pluggable/Extensible CTE detection logic

Does this PR introduce any user-facing change?

By default no. After tuning the new/existing cte properties query plans may change affecting performance.

Is the change a dependency upgrade?

No

How was this patch tested?

Tests for:

CTE detection logic using the CommonTableExpressionPrintSuggester (TestTezTPCDS30TBPerfCliDriver)
demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q)
verify coherence of CTE rewrite with hive.optimize.cte.materialize.full.aggregate.only (cte_mat_12.q)
spool JSON serialization (cte_cbo_plan_json.q)

1. Add `applyCteRewriting` phase in `CalcitePlanner` for detecting and using CTEs; ensure rewrite logic is consistent with existing `hive.optimize.cte.materialize.*` properties. 2. Model CTEs as materialized views (MVs) and add utility method in `HiveMaterializedViewUtils` for mapping a CTE to a `RelOptMaterialization`. 3. Refactor core MV rewrite logic in `CalcitePlanner` to use during CTE rewrite and exploit CTEs in a cost-based manner. 4. Add `HiveTableSpool` operator to represent CTEs and handle them in the plan using new rules: `TableScanToSpoolRule` and `RemoveUnusedCteRule`. 5. Add `TableScanToSpoolRule`, and `RemoveUnusedCteRule` to add/remove spools from the plan. 6. Enhance/Enrich metadata handlers for handling the Spool operator. 7. Add `AggregatedColumns` metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with `hive.optimize.cte.materialize.full.aggregate.only` property. 8. Add `HiveSqlTypeUtil.containsSqlType` for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217). 9. Add `hive.optimize.cte.suggester.class` and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath. 10. Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment. 11. Map spool(s) to `WITH` clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752). 12. Modify (slighly) `SemanticAnalyzer`/`CalcitePlanner` to enable AST-based CTE materialization (getMetadata) post CBO run. 13. Tests for: * CTE detection logic using the `CommonTableExpressionPrintSuggester` (`TestTezTPCDS30TBPerfCliDriver`) * demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q) * verify coherence of CTE rewrite with `hive.optimize.cte.materialize.full.aggregate.only` (cte_mat_12.q) * spool JSON serialization (cte_cbo_plan_json.q)

sonarcloud · 2024-05-14T15:12:19Z

Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

asolimando

LGTM, pretty elegant approach cleanly implemented. I have left few minor comments but nothing blocking, hence the approval.

I have only skimmed few out files for the TPCDS test suite, I haven't checked them all.

asolimando · 2024-05-14T14:57:32Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/CommonRelSubExprRegisterRule.java

+ @Override
+ public void onMatch(final RelOptRuleCall call) {
+ CommonTableExpressionRegistry r = call.getPlanner().getContext().unwrap(CommonTableExpressionRegistry.class);
+ if (r != null) {


In what cases it's expected to be null?

asolimando · 2024-05-14T14:59:09Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePartitionPruneRule.java

@@ -50,6 +50,9 @@ protected void perform(RelOptRuleCall call, Filter filter,
 HiveTableScan tScan) {
 // Original table
 RelOptHiveTable hiveTable = (RelOptHiveTable) tScan.getTable();
+ if (hiveTable.getHiveTableMD().isTemporary()) {


Maybe a comment could help people down the line, it's clear in the PR that we are introducing temporary tables for CTEs but it might be confusing by looking at the rule in isolation

asolimando · 2024-05-14T15:32:48Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/RemoveUnusedCteRule.java

+ * Rule that removes unused CTEs from the plan by expanding the respective TableScans.
+ * <p>The rule assumes that all materializations registered in the planner refer to CTEs.</p>
+ */
+public class RemoveUnusedCteRule extends RelOptRule {


Nit: "Unused" it's a bit misleading as it's usually means used 0 times, I don't have a better name right away but it might be worth thinking about it

asolimando · 2024-05-14T15:34:26Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/RemoveUnusedCteRule.java

+public class RemoveUnusedCteRule extends RelOptRule {
+
+ private final int referenceThreshold;
+ private final Map<List<String>, Long> tableOccurrences;


Nit: I think Integer is big enough to reference CTEs within a single query plan.

asolimando · 2024-05-14T15:36:42Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/RemoveUnusedCteRule.java

+ private final int referenceThreshold;
+ private final Map<List<String>, Long> tableOccurrences;
+
+ public RemoveUnusedCteRule(Map<List<String>, Long> tableOccurrences, int referenceThreshold) {


Similarly, if the occurrences need a Long, referenceThreshold needs long type too.

asolimando · 2024-05-14T15:43:24Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/AggregatedColumns.java

+/**
+ * Metadata about whether a set of columns originates from aggregation functions.
+ */
+public interface AggregatedColumns extends Metadata {


Question: don't we have anything similar in Calcite already? If not, it's too Hive specific or we could file a Jira to add it?

asolimando · 2024-05-14T15:49:29Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTBuilder.java

- .add(HiveParser.Identifier, hTbl.getHiveTableMD().getDbName())
- .add(HiveParser.Identifier, hTbl.getHiveTableMD().getTableName());
+ ASTBuilder tableNameBuilder = ASTBuilder.construct(HiveParser.TOK_TABNAME, "TOK_TABNAME");
+ if (!hTbl.getHiveTableMD().isTemporary()) {


Nit: as discussed elsewhere, maybe a comment could be useful

asolimando · 2024-05-14T15:51:51Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java

- return c.convert();
+ ASTConverter c = new ASTConverter(root, 0, planMapper, new ArrayList<>());
+ ASTNode r = c.convert();
+ for (ASTNode cte : c.ctes) {


Not super familiar with this part of the codebase, but why do we need to manually add the ctes in this case?

asolimando · 2024-05-14T15:55:18Z

ql/src/test/queries/clientpositive/cte_cbo_plan_json.q

+EXPLAIN FORMATTED CBO
+SELECT name FROM emps e WHERE salary > 50000
+UNION
+SELECT name FROM emps e WHERE salary > 50000;


Nit: missing newline at the end of the file

asf-ci-hive added the tests pending label May 14, 2024

asolimando approved these changes May 14, 2024

View reviewed changes

asf-ci-hive added tests unstable and removed tests pending labels May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28259: Common table expression detection and rewrites using CBO #5249

HIVE-28259: Common table expression detection and rewrites using CBO #5249

zabetak commented May 14, 2024

sonarcloud bot commented May 14, 2024

asolimando left a comment

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

asolimando May 14, 2024

HIVE-28259: Common table expression detection and rewrites using CBO #5249

Are you sure you want to change the base?

HIVE-28259: Common table expression detection and rewrites using CBO #5249

Conversation

zabetak commented May 14, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

sonarcloud bot commented May 14, 2024

Quality Gate passed

asolimando left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment