You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar issues.
Description
In multi-table join scenario, result of a join will be used as the input of the subsequent joins.
During the interval between updating statistics, when we update data, statistics is not collected in time, If we run a multi-table join query within this time interval, the optimizer may choose a poor strategy because there is no correct statistics.
for example:
select t1.* from example_tbl t1
join example_tbl02 t2 on t1.city=t2.city and t1.city="chengdu"
join example_tbl03 t3 on t1.city=t3.city
join example_tbl04 t4 on t1.city=t4.city
join example_tbl05 t5 on t1.city=t5.city;
### Doris-2.0.4:
this is plan(2.0.4):
in this case, example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu"; example_tbl, example_tbl03 don't have any data of city="chengdu".
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
we can simplify this plan:
actual execution situation: braodcast-01: distribute 2 rows of data broadcast-02: distribute 2 rows of data broadcast-03: distribute 120 million (2 * 2 * 30million) rows of data broadcast-04: ditribute 3600 trillion(2 * 2 * 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
### Doris-master:
same query'plan(with same statistics data) in master branch code:
example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu";
example_tbl, example_tbl03 don't have any data of city="chengdu"
we can simplify this plan:
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
this is actual execution situation: braodcast-01: distribute 2 rows of data broadcast-02: distribute 2 rows of data broadcast-03: distribute 120 million (2 * 2 * 30million) rows of data broadcast-04: ditribute 3600 trillion(2 * 2 * 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
Solution
I solved this problem, this is plan:
we can simplify this plan:
as you can see, the type of data's distribution has changed from broadcast to shuffle hash, which avoid OOM
there is a mis-understanding of physical plan.
join
|---->A
+--->B
here A is the left child, and B is the right child. The join children should be switched in the figures.
Search before asking
Description
In multi-table join scenario, result of a join will be used as the input of the subsequent joins.
During the interval between updating statistics, when we update data, statistics is not collected in time, If we run a multi-table join query within this time interval, the optimizer may choose a poor strategy because there is no correct statistics.
for example:
select t1.* from example_tbl t1
join example_tbl02 t2 on t1.city=t2.city and t1.city="chengdu"
join example_tbl03 t3 on t1.city=t3.city
join example_tbl04 t4 on t1.city=t4.city
join example_tbl05 t5 on t1.city=t5.city;
### Doris-2.0.4:
this is plan(2.0.4):
in this case, example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu"; example_tbl, example_tbl03 don't have any data of city="chengdu".
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
we can simplify this plan:
actual execution situation:
braodcast-01: distribute 2 rows of data
broadcast-02: distribute 2 rows of data
broadcast-03: distribute 120 million (2 * 2 * 30million) rows of data
broadcast-04: ditribute 3600 trillion(2 * 2 * 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
### Doris-master:
same query'plan(with same statistics data) in master branch code:
example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu";
example_tbl, example_tbl03 don't have any data of city="chengdu"
we can simplify this plan:
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
this is actual execution situation:
braodcast-01: distribute 2 rows of data
broadcast-02: distribute 2 rows of data
broadcast-03: distribute 120 million (2 * 2 * 30million) rows of data
broadcast-04: ditribute 3600 trillion(2 * 2 * 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
Solution
I solved this problem, this is plan:
we can simplify this plan:
as you can see, the type of data's distribution has changed from broadcast to shuffle hash, which avoid OOM
Are you willing to submit PR?
[opt](nereids) modify cbo statistics in filter and scan #35238
Code of Conduct
The text was updated successfully, but these errors were encountered: