-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InclusiveMetricsEvaluator
#347
Add InclusiveMetricsEvaluator
#347
Conversation
3335586
to
058b8c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd this looks nice! I had just one comment/ question about the caching of the InclusiveMetricsEvaluator
and some nits abouts either names, or using "early returns" / match arms, instead of nested if/else constructs. But I'll guess those nits are more personal preferences - so I'm interested what others have to say here. Thanks again for your work here.
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
f39ebc5
to
dd419b3
Compare
@marvinlanhenke Thanks for taking the time for the great review. Have addressed your raised points. I'm adding tests based on https://github.com/apache/iceberg/blob/main/api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java and will submit them over the next day or so, once complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// see if this `DataFile` contains data that could match | ||
/// the scan's filter. | ||
pub(crate) fn eval( | ||
filter: &'a BoundPredicate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an idea:
Can we accept &'a Option<BoundPredicate>
here?
This way we can move the check if we have a row_filter at all from scan.rs into the InclusiveMetricsEvaluator
itself and simply return ROWS_MIGHT_MATCH
if BoundPredicate is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like that. It results in less code but it doesn't feel right, semantically. I'll have a think to see if there's a more concise way to do this within TableScan
. Perhaps a shorter code path if there is no filter, with a longer path if there is one, but that might involve a bit of duplication. Alternatively we could set the filter predicate to AlwaysTrue
if none is supplied to the scan.
crates/iceberg/src/scan.rs
Outdated
@@ -218,6 +230,18 @@ impl TableScan { | |||
|
|||
let mut manifest_entries = iter(manifest.entries().iter().filter(|e| e.is_alive())); | |||
while let Some(manifest_entry) = manifest_entries.next().await { | |||
|
|||
if let Some(ref bound_predicate) = bound_predicate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the check into InclusiveMetricsEvaluator
(see other comment), which returns ROWS_MIGHT_MATCH if bound_predicate is None. This way we could entangle the already involved plan_files method?
e3fa225
to
34ca88e
Compare
34ca88e
to
09ec3c4
Compare
InclusiveMetricsEvaluator
InclusiveMetricsEvaluator
1cf1d5e
to
4432826
Compare
FAO @Fokko @liurenjie1024 @marvinlanhenke: I've finished adding tests for this - it's ready for review, PTAL! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdd
Thank you so much for your work here. LGTM. I had just one question and some minor nits. LGTM!! And thanks for porting the test-suite - this is a lot of work.
if !include_empty_files && data_file.record_count == 0 { | ||
return ROWS_CANNOT_MATCH; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this extra check (for older versions) as well, or can we ignore this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko what's your opinion on this? I don't know enough about why this is here in the other implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record_count
is a u64
, so it can't ever have a value of -1
, and so that check doesn't make sense to have in iceberg-rust as it is right now. Not unless we change record_count
to be an i64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we may throw an error, since it's u64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I was touching the grass. I think it is fair to leave out the check and rely on u64
👍
datum: &Datum, | ||
_predicate: &BoundPredicate, | ||
) -> crate::Result<bool> { | ||
self.visit_inequality(reference, datum, PartialOrd::lt, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it as well, very elegant 👍 Out of curiosity, is there a rust-argument about why not passing in the bound directly:
self.visit_inequality(reference, datum, PartialOrd::lt, true) | |
self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id)) |
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs
Outdated
Show resolved
Hide resolved
5260ae6
to
1a2c925
Compare
1a2c925
to
072ee2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @sdd for this great pr and all the tests!
if !include_empty_files && data_file.record_count == 0 { | ||
return ROWS_CANNOT_MATCH; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we may throw an error, since it's u64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thanks for working on this @sdd and porting all the tests. I did the same for PyIceberg and it is quite a bit of work 👍
let nan_count = self.nan_count(field_id); | ||
let value_count = self.value_count(field_id); | ||
|
||
nan_count.is_some() && nan_count == value_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to check if value_count
is not null?
datum: &Datum, | ||
_predicate: &BoundPredicate, | ||
) -> crate::Result<bool> { | ||
self.visit_inequality(reference, datum, PartialOrd::lt, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it as well, very elegant 👍 Out of curiosity, is there a rust-argument about why not passing in the bound directly:
self.visit_inequality(reference, datum, PartialOrd::lt, true) | |
self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id)) |
InclusiveMetricsEvaluator
is used inside table scans to filterDataFile
entries within a Manifest, rejecting any of them if their metrics indicate that they cannot contain any rows that match the predicate filter.Comments / suggestions are welcome.
Closes #127