Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter IN pushdown to different decoders #1525

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

YiqinXiong
Copy link

@YiqinXiong YiqinXiong commented Aug 22, 2023

Task Description

ref #1491

Solution Description

Add/Update white filter 'in' functions for RAW/DICT/RLE/CONST/INTEGER_BASE_DIFF decoders.

Basic process:

  • RAW:
    1. batch decode
    2. traverse all rows, apply filter and set result_bitmap
  • DICT:
    1. apply filter in dict, set ref_bitset
    2. traverse all refs, set result_bitmap by using ref_bitset
  • RLE:
    1. apply filter in dict, set ref_bitset
    2. traverse all refs, set result_bitmap by using ref_bitset
  • CONST:
    1. if const value in filter params (i.e. right arguments of expr 'in'), flip over result_bitmap
    2. apply filter in dict, set ref_bitset . It needs to be combined with whether the const value is in the params.
    3. traverse all refs, set result_bitmap by using ref_bitset
  • INTEGER_BASE_DIFF:
    1. batch decode
    2. if max(params) < base_value , set result_bitmap all false
    3. else traverse all rows, apply filter and set result_bitmap

Passed Regressions

Unittest

Passed related unittests:

  1. unittest/storage/blocksstable/encoding/test_general_column_decoder
  2. unittest/storage/blocksstable/encoding/test_raw_decoder
  3. unittest/storage/blocksstable/encoding/test_const_decoder

Mysql test

Part 1. Load data

Number of data:
1,000,000 rows.

Data types:

Five tables were created for testing based on five different encoding types.

The data types included in each table are shown in the table below:

  bigint int char varchar decimal ubigint uint
RAW
DICT
RLE
CONST
INT_BASE_DIFF

Data organization format:

The data for each encoding is generated by a Python script through pseudorandom calculations based on different seeds. And we will do manual major freeze to make columns encoded.

In order to form the corresponding encoding, the data is arranged according to the following rules:

RAW: Completely random, values within each column are mostly different.

A D R X Y Q ... L T H

DICT: Values within the dictionary appear randomly. eg: the dictionary contains values {A, B, C}.

B A A C B A ... C A B

RLE: Values within the dictionary appear consecutively in order. eg: the dictionary contains values {A, B, C}.

A A A A B B ... B C C

CONST: 99.9% the default value C appearing, and a 0.1% probability of values {A, B} within the dictionary appearing.

C C C A C C ... C B C

INTEGER_BASE_DIFF: Random data appears within a certain range, and the minimum value within the data is taken as the base value B.

B+v1 B+v2 B B+v3 B+v4 B+v5 ... B+vn-2 B+vn-1 B+vn
Part 2. Run

Pattern of SQL statements used for testing:

select count(col_name) from table_name where col_name in (params);

In the code above, col_name is the column name, table_name is the table name, and params is the parameter list for IN, containing params.size() parameters that need to be matched.

Variables in the test:

  1. Scenarios:

T for values in column (RAW / INTEGER_BASE_DIFF)
F for values not in column
D for values in dict table (DICT / RLE / CONST)
C for const value (CONST)
n for number of IN params

  • RAW / INTEGER_BASE_DIFF:
    • NO: IN (F * n)
    • PART: IN (T * 0.5n, F * 0.5n)
  • DICT / RLE:
    • NO: IN (F * n)
    • PART: IN (D * k, F * (n-k)), where k=min(n/2, len(dict)/2)
  • CONST
    • NO: IN (F * n)
    • PART: IN (D * k, F * (n-k)), where k=min(n/2, len(dict)/2)
    • PART_CONST_IN: IN (C, F * (n-1))
  1. Number of IN parameters:
    4 / 40 / 400
Part 3. Result

Meaning of values:
The values in the table below represent the execution time of SQL queries on the master branch divided by the execution time of the same queries on the issue branch (i.e. The speedup ratio of the issue branch compared to the master branch).

Performance test results:

Notice: The following results are obtained by fully utilizing the cache for hot queries and are averaged over 3 repetitions with discarding the first round result.

RAW:

Scenario num of IN (*) bigint int char(20) varchar(40) decimal(11,3) ubigint uint
NO 4 1.088 1.094 1.032 1.815 1.193 1.091 1.108
PART 4 1.080 1.076 1.016 1.424 1.192 1.117 1.117
NO 40 1.044 1.018 1.003 1.002 1.141 1.063 1.065
PART 40 1.032 1.009 1.004 1.002 1.132 1.085 1.085
NO 400 1.082 1.059 1.010 1.005 1.217 1.090 1.104
PART 400 1.057 1.069 1.001 1.019 1.198 1.102 1.104

DICT:

Scenario num of IN (*) bigint int char(20) varchar(40) decimal(11,3) ubigint uint
NO 4 47.600 53.400 119.545 167.273 50.545 45.100 49.200
PART 4 4.560 4.729 13.894 19.364 5.325 4.185 4.603
NO 40 51.600 57.500 129.600 165.917 56.091 49.400 54.000
PART 40 3.988 3.747 12.847 18.720 4.412 3.863 3.852
NO 400 48.100 54.500 111.909 117.313 53.091 49.700 53.100
PART 400 3.868 3.656 12.530 18.133 4.558 4.013 3.779

RLE:

Scenario num of IN (*) bigint int char(20) varchar(40) decimal(11,3) ubigint uint
NO 4 45.200 50.700 112.182 126.286 53.500 44.900 49.000
PART 4 10.894 9.814 32.909 45.240 12.000 10.646 9.839
NO 40 50.900 53.700 112.182 158.083 61.000 49.600 53.900
PART 40 7.695 6.116 24.667 32.989 8.239 7.512 6.299
NO 400 47.700 51.900 101.667 140.769 50.273 49.200 52.100
PART 400 7.226 6.035 24.053 32.523 8.261 7.506 6.490

CONST:

Scenario num of IN (*) bigint int char(20) varchar(40) decimal(11,3) ubigint uint
NO 4 44.400 49.300 149.100 245.100 51.500 44.800 48.600
PART 4 23.789 26.474 56.700 132.905 27.474 23.947 25.947
PART_CONST_IN 4 4.936 4.136 19.353 39.299 5.490 4.964 4.240
NO 40 53.900 49.000 102.273 234.833 53.800 44.500 48.600
PART 40 24.955 23.905 49.348 113.040 24.227 20.773 23.619
PART_CONST_IN 40 5.465 4.137 19.697 41.162 5.644 4.979 4.208
NO 400 44.400 49.400 139.000 203.750 52.500 52.700 57.600
PART 400 20.455 22.818 61.040 99.160 24.545 24.273 26.545
PART_CONST_IN 400 4.971 4.052 22.059 38.236 5.732 5.554 4.663

INTEGER_BASE_DIFF:

Scenario num of IN (*) bigint int ubigint uint
NO 4 37.833 37.333 38.167 37.750
PART 4 1.160 1.195 1.139 1.177
NO 40 44.500 49.273 45.083 49.000
PART 40 1.088 1.074 1.091 1.104
NO 400 24.238 42.417 41.750 46.364
PART 400 1.130 1.112 1.130 1.138

Other Information

Result csv files:
result-issue.csv
result-master.csv
mysql_test.zip

@YiqinXiong YiqinXiong force-pushed the issue_1491 branch 3 times, most recently from 8c26e69 to 939b84a Compare August 28, 2023 09:06
@YiqinXiong YiqinXiong marked this pull request as ready for review August 28, 2023 09:07
@@ -1015,6 +1077,8 @@ void ObWhiteFilterExecutor::check_null_params()
int ObWhiteFilterExecutor::init_obj_set()
{
int ret = OB_SUCCESS;
obj_array_sorted_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个成员不需要,保证数组有序就行

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个成员不需要,保证数组有序就行

收到, 已修改 commit 63e156e

@@ -843,14 +848,67 @@ int ObRawDecoder::in_operator(
|| NULL == row_index)) {
ret = OB_INVALID_ARGUMENT;
LOG_WARN("Pushdown in operator: Invalid arguments", K(ret), K(filter.get_objs()));
} else if (OB_LIKELY(can_vectorized()) && OB_LIKELY(is_inited())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断不需要吧

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断不需要吧

收到, 已修改 commit a2f7326

int64_t *row_ids_;
common::ObIAllocator *allocator_;
// for white filter IN batch_decode
common::ObDatum *white_batch_datums_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可以从表达式中拿到,不需要自己alloc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到, 已修改 commit d3522a7

int ObWhiteFilterExecutor::eval_right_val_to_objs()
{
int ret = OB_SUCCESS;
const ObExpr &expr = *(filter_.expr_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先判null防御

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,已修改 commit 2284e8c


struct ObWhiteFilterParamsCmpFunc
{
OB_INLINE bool operator()(const common::ObObj &obj1, const common::ObObj &obj2) const {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj compare函数可能返回ERROR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,已修改 commit ecd9691

// 2. make params sorted
if (OB_FAIL(ret)) {
} else {
std::sort(params_.begin(), params_.end(), ObWhiteFilterParamsCmpFunc());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般cmp fun需要带入ret
sort之后如果报错需要带出错误码

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,已修改 commit ecd9691

// obj > max_obj || obj < min_obj
is_exist = false;
} else {
is_exist = std::binary_search(params_.begin(), params_.end(), obj, ObWhiteFilterParamsCmpFunc());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上, 尽量不用标准库

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,未带出错误码的问题已修改 commit ecd9691

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果不使用标准库函数的话,想问一下这里OB有自己更好的实现吗,或者说自己手撸一个?用这个主要是因为看到在其他位置 std::sortstd::binary_search 也有被用到😂

} else if (OB_FAIL(cur_arg->eval(ctx, right))) {
LOG_WARN("failed to eval right datum", K(ret));
} else if (!null_param_contained_ && right->is_null()) {
null_param_contained_ = true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in(xxx, null) 的时候, null是不是可以忽略掉? 这里不需要标注null param?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,已修改 commit 2284e8c

check_null_params();
if (WHITE_OP_IN == filter_.get_op_type() && OB_FAIL(init_obj_set())) {
LOG_WARN("Failed to init Object hash set in filter node", K(ret));
int ObWhiteFilterExecutor::eval_in_right_val_to_objs()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和 eval_right_val_to_objs代码重复比较多

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,已修改 commit 2284e8c

ObPushdownWhiteFilterNode &filter_;
ObDatum* batch_decode_datums_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要加这个成员,直接调 get_datums_from_column就行

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,已修改 commit 94ac9b4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants