fix: Compute murmur3 hash with dictionary input correctly #433

advancedxy · 2024-05-15T12:47:37Z

Which issue does this PR close?

Closes #427

Rationale for this change

Bug fixes. When submitting #424, we found there's a bug in spark_hash, which doesn't handle dictionary array correctly.
This PR tries to fix this first.

What changes are included in this PR?

refactor some part of spark_hash.rs and be ready for xxhash64 support
unpack dictionary when computing with hashes
updated test

This PR currently depends on #426, will rebase once that's merged.

How are these changes tested?

Updated test with randomized input.

kazuyukitanimura · 2024-05-15T17:38:08Z

core/src/execution/datafusion/spark_hash.rs

- }
- }
- }
+ hash_array_boolean!(BooleanArray, col, i32, hashes_buffer);


I am wondering why this is a macro. It looks like this is the only use case?

Two reasons:

it could be reused for xxhash64 too, which I am currently working in feat: Add xxhash64 function support #424

mainly style issue, to be consistent with other types in this function, which are all called by macro.

spark/src/main/scala/org/apache/comet/expressions/CometCast.scala

viirya · 2024-05-15T18:41:24Z

core/src/execution/datafusion/spark_hash.rs

+macro_rules! hash_array_boolean {
+ ($array_type: ident, $column: ident, $hash_input_type: ident, $hashes: ident) => {
+ let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+ if array.null_count() == 0 {
+ for (i, hash) in $hashes.iter_mut().enumerate() {
+ *hash = spark_compatible_murmur3_hash(
+ $hash_input_type::from(array.value(i)).to_le_bytes(),
+ *hash,
+ );
+ }
+ } else {
+ for (i, hash) in $hashes.iter_mut().enumerate() {
+ if !array.is_null(i) {
+ *hash = spark_compatible_murmur3_hash(
+ $hash_input_type::from(array.value(i)).to_le_bytes(),
+ *hash,
+ );
+ }
+ }
+ }
+ };
+}


This is pull out as a macro because you will use different hash function other than spark_compatible_murmur3_hash later?

yeah. It could be used to support xxhash64 function.

core/src/execution/datafusion/spark_hash.rs

advancedxy · 2024-05-16T00:55:14Z

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

+ test("hash functions with random input") {
+ val dataGen = DataGenerator.DEFAULT
+ // sufficient number of rows to create dictionary encoded ArrowArray.
+ val randomNumRows = 1000


Some note here:
I'm not 100 percent sure how we could trigger a dictionary array in the native side from Spark.

When the random number is small, such as 100/200, there's no dictionary array involved in the native side, although the parquet should be written as all columns dictionary encoded.

I tweaked a bit and settled with 1000, which triggers a dictionary encoded ArrowArray in the rust side.

Potentially we can add repeated values to force dictionary. E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows

E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows

So dictionary encoding is only triggered with enough repetition?

Yes, makeParquetFileAllTypes or some existing dictionary related tests may be helpful

The Parquet file writer will automatically generate a dictionary if the cardinality is low (i.e there is a small number of unique values).

core/src/execution/datafusion/spark_hash.rs

viirya · 2024-05-16T04:09:36Z

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

+  |insert into $table values
+  |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)
+  |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)
+  |""".stripMargin)


Did you insert extra space characters?

~~oops, this is by accident. Let me try again and revert it.~~

Did another check. The current version now has 4 space indentations, which should be correct.
I think it was wrong in the previous commit and could be updated in this PR.

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

advancedxy · 2024-05-16T13:39:01Z

@viirya @kazuyukitanimura @sunchao PTAL when you have time.

Co-authored-by: Liang-Chi Hsieh <[email protected]>

advancedxy · 2024-05-20T02:40:48Z

Gently ping @viirya @sunchao and @andygrove

kazuyukitanimura

LGTM

andygrove

LGTM. Thank you @advancedxy

advancedxy mentioned this pull request May 15, 2024

feat: Add xxhash64 function support #424

Merged

advancedxy changed the title ~~fix: Handle compute murmur3 hash with dictionary input correctly~~ fix: Compute murmur3 hash with dictionary input correctly May 15, 2024

andygrove requested a review from sunchao May 15, 2024 15:04

kazuyukitanimura reviewed May 15, 2024

View reviewed changes

viirya reviewed May 15, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

viirya reviewed May 15, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

advancedxy added 5 commits May 16, 2024 08:35

fix: Handle compute murmur3 hash with dictionary input correctly

07705a3

add unit tests

e273e27

spotless apply

2ed2ddb

apply scala fix

0a0ee52

address comment

a417a20

advancedxy force-pushed the fix_murmur3_hash branch from cadb5be to a417a20 Compare May 16, 2024 00:39

another style issue

b708ba0

advancedxy commented May 16, 2024

View reviewed changes

viirya reviewed May 16, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Show resolved Hide resolved

viirya reviewed May 16, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Outdated Show resolved Hide resolved

Update spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

089b37a

Co-authored-by: Liang-Chi Hsieh <[email protected]>

kazuyukitanimura approved these changes May 20, 2024

View reviewed changes

andygrove approved these changes May 24, 2024

View reviewed changes

andygrove merged commit 93af704 into apache:main May 24, 2024
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Compute murmur3 hash with dictionary input correctly #433

fix: Compute murmur3 hash with dictionary input correctly #433

advancedxy commented May 15, 2024 •

edited

kazuyukitanimura May 15, 2024

advancedxy May 16, 2024

viirya May 15, 2024

advancedxy May 16, 2024

advancedxy May 16, 2024

kazuyukitanimura May 20, 2024

advancedxy May 21, 2024

kazuyukitanimura May 21, 2024

parthchandra May 23, 2024

viirya May 16, 2024

advancedxy May 16, 2024 •

edited

advancedxy commented May 16, 2024

advancedxy commented May 20, 2024

kazuyukitanimura left a comment

andygrove left a comment

fix: Compute murmur3 hash with dictionary input correctly #433

fix: Compute murmur3 hash with dictionary input correctly #433

Conversation

advancedxy commented May 15, 2024 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy May 16, 2024 • edited

Choose a reason for hiding this comment

advancedxy commented May 16, 2024

advancedxy commented May 20, 2024

kazuyukitanimura left a comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

advancedxy commented May 15, 2024 •

edited

advancedxy May 16, 2024 •

edited