Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Compute murmur3 hash with dictionary input correctly #433

Merged
merged 7 commits into from
May 24, 2024

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented May 15, 2024

Which issue does this PR close?

Closes #427

Rationale for this change

Bug fixes. When submitting #424, we found there's a bug in spark_hash, which doesn't handle dictionary array correctly.
This PR tries to fix this first.

What changes are included in this PR?

  1. refactor some part of spark_hash.rs and be ready for xxhash64 support
  2. unpack dictionary when computing with hashes
  3. updated test

This PR currently depends on #426, will rebase once that's merged.

How are these changes tested?

Updated test with randomized input.

@advancedxy advancedxy changed the title fix: Handle compute murmur3 hash with dictionary input correctly fix: Compute murmur3 hash with dictionary input correctly May 15, 2024
@andygrove andygrove requested a review from sunchao May 15, 2024 15:04
}
}
}
hash_array_boolean!(BooleanArray, col, i32, hashes_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why this is a macro. It looks like this is the only use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  1. it could be reused for xxhash64 too, which I am currently working in feat: Add xxhash64 function support #424
  2. mainly style issue, to be consistent with other types in this function, which are all called by macro.

Comment on lines +118 to +139
macro_rules! hash_array_boolean {
($array_type: ident, $column: ident, $hash_input_type: ident, $hashes: ident) => {
let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
if array.null_count() == 0 {
for (i, hash) in $hashes.iter_mut().enumerate() {
*hash = spark_compatible_murmur3_hash(
$hash_input_type::from(array.value(i)).to_le_bytes(),
*hash,
);
}
} else {
for (i, hash) in $hashes.iter_mut().enumerate() {
if !array.is_null(i) {
*hash = spark_compatible_murmur3_hash(
$hash_input_type::from(array.value(i)).to_le_bytes(),
*hash,
);
}
}
}
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pull out as a macro because you will use different hash function other than spark_compatible_murmur3_hash later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. It could be used to support xxhash64 function.

core/src/execution/datafusion/spark_hash.rs Show resolved Hide resolved
core/src/execution/datafusion/spark_hash.rs Show resolved Hide resolved
test("hash functions with random input") {
val dataGen = DataGenerator.DEFAULT
// sufficient number of rows to create dictionary encoded ArrowArray.
val randomNumRows = 1000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some note here:
I'm not 100 percent sure how we could trigger a dictionary array in the native side from Spark.

When the random number is small, such as 100/200, there's no dictionary array involved in the native side, although the parquet should be written as all columns dictionary encoded.

I tweaked a bit and settled with 1000, which triggers a dictionary encoded ArrowArray in the rust side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we can add repeated values to force dictionary. E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows

So dictionary encoding is only triggered with enough repetition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makeParquetFileAllTypes or some existing dictionary related tests may be helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parquet file writer will automatically generate a dictionary if the cardinality is low (i.e there is a small number of unique values).

Comment on lines +1455 to +1458
|insert into $table values
|('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)
|, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)
|""".stripMargin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you insert extra space characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this is by accident. Let me try again and revert it.

Did another check. The current version now has 4 space indentations, which should be correct.
I think it was wrong in the previous commit and could be updated in this PR.

@advancedxy
Copy link
Contributor Author

@viirya @kazuyukitanimura @sunchao PTAL when you have time.

@advancedxy
Copy link
Contributor Author

Gently ping @viirya @sunchao and @andygrove

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @advancedxy

@andygrove andygrove merged commit 93af704 into apache:main May 24, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: hash expression is not consistent with Spark
5 participants