Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28259: Common table expression detection and rewrites using CBO #5249

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zabetak
Copy link
Contributor

@zabetak zabetak commented May 14, 2024

What changes were proposed in this pull request?

  1. Add applyCteRewriting phase in CalcitePlanner for detecting and using CTEs; ensure rewrite logic is consistent with existing hive.optimize.cte.materialize.* properties.
  2. Model CTEs as materialized views (MVs) and add utility method in HiveMaterializedViewUtils for mapping a CTE to a RelOptMaterialization.
  3. Refactor core MV rewrite logic in CalcitePlanner to use during CTE rewrite and exploit CTEs in a cost-based manner.
  4. Add HiveTableSpool operator to represent CTEs and handle them in the plan using new rules: TableScanToSpoolRule and RemoveUnusedCteRule.
  5. Add TableScanToSpoolRule, and RemoveUnusedCteRule to add/remove spools from the plan.
  6. Enhance/Enrich metadata handlers for handling the Spool operator.
  7. Add AggregatedColumns metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with hive.optimize.cte.materialize.full.aggregate.only property.
  8. Add HiveSqlTypeUtil.containsSqlType for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217).
  9. Add hive.optimize.cte.suggester.class and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath.
  10. Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment.
  11. Map spool(s) to WITH clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752).
  12. Modify (slighly) SemanticAnalyzer/CalcitePlanner to enable AST-based CTE materialization (getMetadata) post CBO run.

Why are the changes needed?

  • Open the road for cost-based CTE optimizations in Hive
  • Pluggable/Extensible CTE detection logic

Does this PR introduce any user-facing change?

By default no. After tuning the new/existing cte properties query plans may change affecting performance.

Is the change a dependency upgrade?

No

How was this patch tested?

Tests for:

  • CTE detection logic using the CommonTableExpressionPrintSuggester (TestTezTPCDS30TBPerfCliDriver)
  • demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q)
  • verify coherence of CTE rewrite with hive.optimize.cte.materialize.full.aggregate.only (cte_mat_12.q)
  • spool JSON serialization (cte_cbo_plan_json.q)

1. Add `applyCteRewriting` phase in `CalcitePlanner` for detecting and using CTEs; ensure rewrite logic is consistent with existing `hive.optimize.cte.materialize.*` properties.
2. Model CTEs as materialized views (MVs) and add utility method in `HiveMaterializedViewUtils` for mapping a CTE to a `RelOptMaterialization`.
3. Refactor core MV rewrite logic in `CalcitePlanner` to use during CTE rewrite and exploit CTEs in a cost-based manner.
4. Add `HiveTableSpool` operator to represent CTEs and handle them in the plan using new rules: `TableScanToSpoolRule` and `RemoveUnusedCteRule`.
5. Add `TableScanToSpoolRule`, and `RemoveUnusedCteRule` to add/remove spools from the plan.
6. Enhance/Enrich metadata handlers for handling the Spool operator.
7. Add `AggregatedColumns` metadata (and respective handler and metadata query), for controlling if a CTE is a "full aggregate" at the CBO (RelNode) level to ensure consistent behavior with `hive.optimize.cte.materialize.full.aggregate.only` property.
8. Add `HiveSqlTypeUtil.containsSqlType` for detecting and skipping the creation of CTEs with untyped nulls since they are not supported (HIVE-11217).
9. Add `hive.optimize.cte.suggester.class` and CommonTableExpressionSuggester interface to provide pluggable CTE detection logic. Given that CTE detection logic can range from basic tree traversal algorithms to complex workload analysis frameworks this part needs to be configurable since there is no one-size-fits-all implementation. The configuration property also allows proprietary algorithms to be integrated in HiveServer2 by implementing the necessary APIs and adding the jars in the classpath.
10. Add prototype implementation for CTE detection logic in CommonTableExpressionIdentitySuggester using CommonRelSubExprRegisterRule and CommonTableExpressionRegistry. Although the implementation is rather simple it can discover various interesting CTEs as demonstrated by the tests and can be indeed useful in a prod environment.
11. Map spool(s) to `WITH` clauses during the RelNode to AST conversion (ASTConverter, ASTBuilder, PlanModifierForASTConv) to exploit existing CTE materialization feature (HIVE-11752).
12. Modify (slighly) `SemanticAnalyzer`/`CalcitePlanner` to enable AST-based CTE materialization (getMetadata) post CBO run.
13. Tests for:
* CTE detection logic using the `CommonTableExpressionPrintSuggester` (`TestTezTPCDS30TBPerfCliDriver`)
* demonstrating (end-to-end) the CTE feature (cte_cbo_rewrite_0.q)
* verify coherence of CTE rewrite with `hive.optimize.cte.materialize.full.aggregate.only` (cte_mat_12.q)
* spool JSON serialization (cte_cbo_plan_json.q)
Copy link

sonarcloud bot commented May 14, 2024

Quality Gate Passed Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pretty elegant approach cleanly implemented. I have left few minor comments but nothing blocking, hence the approval.

I have only skimmed few out files for the TPCDS test suite, I haven't checked them all.

@Override
public void onMatch(final RelOptRuleCall call) {
CommonTableExpressionRegistry r = call.getPlanner().getContext().unwrap(CommonTableExpressionRegistry.class);
if (r != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases it's expected to be null?

@@ -50,6 +50,9 @@ protected void perform(RelOptRuleCall call, Filter filter,
HiveTableScan tScan) {
// Original table
RelOptHiveTable hiveTable = (RelOptHiveTable) tScan.getTable();
if (hiveTable.getHiveTableMD().isTemporary()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment could help people down the line, it's clear in the PR that we are introducing temporary tables for CTEs but it might be confusing by looking at the rule in isolation

* Rule that removes unused CTEs from the plan by expanding the respective TableScans.
* <p>The rule assumes that all materializations registered in the planner refer to CTEs.</p>
*/
public class RemoveUnusedCteRule extends RelOptRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Unused" it's a bit misleading as it's usually means used 0 times, I don't have a better name right away but it might be worth thinking about it

public class RemoveUnusedCteRule extends RelOptRule {

private final int referenceThreshold;
private final Map<List<String>, Long> tableOccurrences;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think Integer is big enough to reference CTEs within a single query plan.

private final int referenceThreshold;
private final Map<List<String>, Long> tableOccurrences;

public RemoveUnusedCteRule(Map<List<String>, Long> tableOccurrences, int referenceThreshold) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, if the occurrences need a Long, referenceThreshold needs long type too.

/**
* Metadata about whether a set of columns originates from aggregation functions.
*/
public interface AggregatedColumns extends Metadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: don't we have anything similar in Calcite already? If not, it's too Hive specific or we could file a Jira to add it?

.add(HiveParser.Identifier, hTbl.getHiveTableMD().getDbName())
.add(HiveParser.Identifier, hTbl.getHiveTableMD().getTableName());
ASTBuilder tableNameBuilder = ASTBuilder.construct(HiveParser.TOK_TABNAME, "TOK_TABNAME");
if (!hTbl.getHiveTableMD().isTemporary()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: as discussed elsewhere, maybe a comment could be useful

return c.convert();
ASTConverter c = new ASTConverter(root, 0, planMapper, new ArrayList<>());
ASTNode r = c.convert();
for (ASTNode cte : c.ctes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this part of the codebase, but why do we need to manually add the ctes in this case?

EXPLAIN FORMATTED CBO
SELECT name FROM emps e WHERE salary > 50000
UNION
SELECT name FROM emps e WHERE salary > 50000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing newline at the end of the file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants