-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28259: Common table expression detection and rewrites using CBO #5249
base: master
Are you sure you want to change the base?
Conversation
1. Add `applyCteRewriting` phase in `CalcitePlanner` for detecting and using CTEs; ensure rewrite logic is consistent with existing `hive.optimize.cte.materialize.*` properties. 2. Model CTEs as materialized views (MVs) and add utility method in `HiveMaterializedViewUtils` for mapping a CTE to a `RelOptMaterialization`. 3. Refactor core MV rewrite logic in `CalcitePlanner` to use during CTE rewrite and exploit CTEs in a cost-based manner. 4. Add `HiveTableSpool` operator to represent CTEs and handle them in the plan using new rules: `TableScanToSpoolRule` and `RemoveUnusedCteRule`. 5. Add `TableScanToSpoolRule`, and `RemoveUnusedCteRule` to add/remove spools from the plan. 6. Enhance/Enrich metadata handlers for handling the Spool operator. 7. Add `AggregatedColumns` metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with `hive.optimize.cte.materialize.full.aggregate.only` property. 8. Add `HiveSqlTypeUtil.containsSqlType` for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217). 9. Add `hive.optimize.cte.suggester.class` and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath. 10. Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment. 11. Map spool(s) to `WITH` clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752). 12. Modify (slighly) `SemanticAnalyzer`/`CalcitePlanner` to enable AST-based CTE materialization (getMetadata) post CBO run. 13. Tests for: * CTE detection logic using the `CommonTableExpressionPrintSuggester` (`TestTezTPCDS30TBPerfCliDriver`) * demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q) * verify coherence of CTE rewrite with `hive.optimize.cte.materialize.full.aggregate.only` (cte_mat_12.q) * spool JSON serialization (cte_cbo_plan_json.q)
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pretty elegant approach cleanly implemented. I have left few minor comments but nothing blocking, hence the approval.
I have only skimmed few out files for the TPCDS test suite, I haven't checked them all.
@Override | ||
public void onMatch(final RelOptRuleCall call) { | ||
CommonTableExpressionRegistry r = call.getPlanner().getContext().unwrap(CommonTableExpressionRegistry.class); | ||
if (r != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases it's expected to be null
?
@@ -50,6 +50,9 @@ protected void perform(RelOptRuleCall call, Filter filter, | |||
HiveTableScan tScan) { | |||
// Original table | |||
RelOptHiveTable hiveTable = (RelOptHiveTable) tScan.getTable(); | |||
if (hiveTable.getHiveTableMD().isTemporary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment could help people down the line, it's clear in the PR that we are introducing temporary tables for CTEs but it might be confusing by looking at the rule in isolation
* Rule that removes unused CTEs from the plan by expanding the respective TableScans. | ||
* <p>The rule assumes that all materializations registered in the planner refer to CTEs.</p> | ||
*/ | ||
public class RemoveUnusedCteRule extends RelOptRule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "Unused" it's a bit misleading as it's usually means used 0 times, I don't have a better name right away but it might be worth thinking about it
public class RemoveUnusedCteRule extends RelOptRule { | ||
|
||
private final int referenceThreshold; | ||
private final Map<List<String>, Long> tableOccurrences; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think Integer
is big enough to reference CTEs within a single query plan.
private final int referenceThreshold; | ||
private final Map<List<String>, Long> tableOccurrences; | ||
|
||
public RemoveUnusedCteRule(Map<List<String>, Long> tableOccurrences, int referenceThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, if the occurrences need a Long
, referenceThreshold
needs long
type too.
/** | ||
* Metadata about whether a set of columns originates from aggregation functions. | ||
*/ | ||
public interface AggregatedColumns extends Metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: don't we have anything similar in Calcite already? If not, it's too Hive specific or we could file a Jira to add it?
.add(HiveParser.Identifier, hTbl.getHiveTableMD().getDbName()) | ||
.add(HiveParser.Identifier, hTbl.getHiveTableMD().getTableName()); | ||
ASTBuilder tableNameBuilder = ASTBuilder.construct(HiveParser.TOK_TABNAME, "TOK_TABNAME"); | ||
if (!hTbl.getHiveTableMD().isTemporary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: as discussed elsewhere, maybe a comment could be useful
return c.convert(); | ||
ASTConverter c = new ASTConverter(root, 0, planMapper, new ArrayList<>()); | ||
ASTNode r = c.convert(); | ||
for (ASTNode cte : c.ctes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with this part of the codebase, but why do we need to manually add the ctes in this case?
EXPLAIN FORMATTED CBO | ||
SELECT name FROM emps e WHERE salary > 50000 | ||
UNION | ||
SELECT name FROM emps e WHERE salary > 50000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing newline at the end of the file
What changes were proposed in this pull request?
applyCteRewriting
phase inCalcitePlanner
for detecting and using CTEs; ensure rewrite logic is consistent with existinghive.optimize.cte.materialize.*
properties.HiveMaterializedViewUtils
for mapping a CTE to aRelOptMaterialization
.CalcitePlanner
to use during CTE rewrite and exploit CTEs in a cost-based manner.HiveTableSpool
operator to represent CTEs and handle them in the plan using new rules:TableScanToSpoolRule
andRemoveUnusedCteRule
.TableScanToSpoolRule
, andRemoveUnusedCteRule
to add/remove spools from the plan.AggregatedColumns
metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior withhive.optimize.cte.materialize.full.aggregate.only
property.HiveSqlTypeUtil.containsSqlType
for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217).hive.optimize.cte.suggester.class
and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath.WITH
clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752).SemanticAnalyzer
/CalcitePlanner
to enable AST-based CTE materialization (getMetadata) post CBO run.Why are the changes needed?
Does this PR introduce any user-facing change?
By default no. After tuning the new/existing cte properties query plans may change affecting performance.
Is the change a dependency upgrade?
No
How was this patch tested?
Tests for:
CommonTableExpressionPrintSuggester
(TestTezTPCDS30TBPerfCliDriver
)hive.optimize.cte.materialize.full.aggregate.only
(cte_mat_12.q)