Add `InclusiveMetricsEvaluator` #347

sdd · 2024-04-23T19:10:59Z

InclusiveMetricsEvaluator is used inside table scans to filter DataFile entries within a Manifest, rejecting any of them if their metrics indicate that they cannot contain any rows that match the predicate filter.

Comments / suggestions are welcome.

Closes #127

marvinlanhenke

Thanks @sdd this looks nice! I had just one comment/ question about the caching of the InclusiveMetricsEvaluator and some nits abouts either names, or using "early returns" / match arms, instead of nested if/else constructs. But I'll guess those nits are more personal preferences - so I'm interested what others have to say here. Thanks again for your work here.

crates/iceberg/src/scan.rs

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

sdd · 2024-04-29T06:52:50Z

@marvinlanhenke Thanks for taking the time for the great review. Have addressed your raised points. I'm adding tests based on https://github.com/apache/iceberg/blob/main/api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java and will submit them over the next day or so, once complete.

marvinlanhenke

@sdd thanks LGTM.
Just had an idea regarding the bound_predicate check. Perhaps you can think about it.
Let's see what others have to say and take a look at #360 which probably introduces some merge conflicts.

marvinlanhenke · 2024-04-29T18:16:59Z

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

+ /// see if this `DataFile` contains data that could match
+ /// the scan's filter.
+ pub(crate) fn eval(
+ filter: &'a BoundPredicate,


just an idea:
Can we accept &'a Option<BoundPredicate> here?
This way we can move the check if we have a row_filter at all from scan.rs into the InclusiveMetricsEvaluator itself and simply return ROWS_MIGHT_MATCH if BoundPredicate is None?

I'm not sure I like that. It results in less code but it doesn't feel right, semantically. I'll have a think to see if there's a more concise way to do this within TableScan. Perhaps a shorter code path if there is no filter, with a longer path if there is one, but that might involve a bit of duplication. Alternatively we could set the filter predicate to AlwaysTrue if none is supplied to the scan.

marvinlanhenke · 2024-04-29T18:18:58Z

crates/iceberg/src/scan.rs

@@ -218,6 +230,18 @@ impl TableScan {

 let mut manifest_entries = iter(manifest.entries().iter().filter(|e| e.is_alive()));
 while let Some(manifest_entry) = manifest_entries.next().await {
+
+ if let Some(ref bound_predicate) = bound_predicate {


move the check into InclusiveMetricsEvaluator (see other comment), which returns ROWS_MIGHT_MATCH if bound_predicate is None. This way we could entangle the already involved plan_files method?

sdd · 2024-05-04T19:40:20Z

FAO @Fokko @liurenjie1024 @marvinlanhenke:

I've finished adding tests for this - it's ready for review, PTAL! 😄

marvinlanhenke

@sdd
Thank you so much for your work here. LGTM. I had just one question and some minor nits. LGTM!! And thanks for porting the test-suite - this is a lot of work.

marvinlanhenke · 2024-05-05T09:06:47Z

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

+ if !include_empty_files && data_file.record_count == 0 {
+ return ROWS_CANNOT_MATCH;
+ }
+


Do we need this extra check (for older versions) as well, or can we ignore this?

@Fokko what's your opinion on this? I don't know enough about why this is here in the other implementations.

record_count is a u64, so it can't ever have a value of -1, and so that check doesn't make sense to have in iceberg-rust as it is right now. Not unless we change record_count to be an i64.

I think in this case we may throw an error, since it's u64.

Sorry for the delay, I was touching the grass. I think it is fair to leave out the check and rely on u64 👍

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

marvinlanhenke · 2024-05-05T09:15:55Z

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

+ datum: &Datum,
+ _predicate: &BoundPredicate,
+ ) -> crate::Result<bool> {
+ self.visit_inequality(reference, datum, PartialOrd::lt, true)


this is nice!

I like it as well, very elegant 👍 Out of curiosity, is there a rust-argument about why not passing in the bound directly:

Suggested change

self.visit_inequality(reference, datum, PartialOrd::lt, true)

self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id))

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

liurenjie1024

LGTM, thanks @sdd for this great pr and all the tests!

liurenjie1024 · 2024-05-19T14:32:54Z

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

+ if !include_empty_files && data_file.record_count == 0 {
+ return ROWS_CANNOT_MATCH;
+ }
+


I think in this case we may throw an error, since it's u64.

Fokko

This is awesome, thanks for working on this @sdd and porting all the tests. I did the same for PyIceberg and it is quite a bit of work 👍

Fokko · 2024-05-23T07:27:41Z

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

+ let nan_count = self.nan_count(field_id);
+ let value_count = self.value_count(field_id);
+
+ nan_count.is_some() && nan_count == value_count


Do we also want to check if value_count is not null?

Fokko · 2024-05-23T07:39:33Z

crates/iceberg/src/expr/visitors/inclusive_metrics_evaluator.rs

+ datum: &Datum,
+ _predicate: &BoundPredicate,
+ ) -> crate::Result<bool> {
+ self.visit_inequality(reference, datum, PartialOrd::lt, true)


I like it as well, very elegant 👍 Out of curiosity, is there a rust-argument about why not passing in the bound directly:

Suggested change

self.visit_inequality(reference, datum, PartialOrd::lt, true)

self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id))

Fokko mentioned this pull request Apr 24, 2024

Tracking issues of iceberg-rust v0.3.0 #348

Open

72 tasks

sdd force-pushed the add-inclusive-metrics-evaluator branch 10 times, most recently from 3335586 to 058b8c5 Compare April 26, 2024 08:38

marvinlanhenke reviewed Apr 28, 2024

View reviewed changes

sdd force-pushed the add-inclusive-metrics-evaluator branch 2 times, most recently from f39ebc5 to dd419b3 Compare April 29, 2024 06:50

marvinlanhenke reviewed Apr 29, 2024

View reviewed changes

sdd force-pushed the add-inclusive-metrics-evaluator branch 2 times, most recently from e3fa225 to 34ca88e Compare April 30, 2024 18:52

sdd mentioned this pull request May 3, 2024

feat: add ExpressionEvaluator #363

Open

sdd force-pushed the add-inclusive-metrics-evaluator branch from 34ca88e to 09ec3c4 Compare May 4, 2024 00:04

sdd changed the title ~~[WIP]: Add InclusiveMetricsEvaluator~~ Add InclusiveMetricsEvaluator May 4, 2024

sdd marked this pull request as ready for review May 4, 2024 08:22

feat: add InclusiveMetricsEvaluator

0105bbf

sdd force-pushed the add-inclusive-metrics-evaluator branch 3 times, most recently from 1cf1d5e to 4432826 Compare May 4, 2024 15:49

marvinlanhenke approved these changes May 5, 2024

View reviewed changes

sdd force-pushed the add-inclusive-metrics-evaluator branch 2 times, most recently from 5260ae6 to 1a2c925 Compare May 6, 2024 10:43

test: add more tests for InclusiveMetricsEvaluator

072ee2a

sdd force-pushed the add-inclusive-metrics-evaluator branch from 1a2c925 to 072ee2a Compare May 6, 2024 10:47

liurenjie1024 approved these changes May 19, 2024

View reviewed changes

liurenjie1024 merged commit e1c10b5 into apache:main May 19, 2024
6 checks passed

sdd mentioned this pull request May 21, 2024

feat: Implement data file metrics evaluator to prune data files using filter. #152

Closed

Fokko reviewed May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `InclusiveMetricsEvaluator` #347

Add `InclusiveMetricsEvaluator` #347

sdd commented Apr 23, 2024 •

edited by liurenjie1024

marvinlanhenke left a comment

sdd commented Apr 29, 2024

marvinlanhenke left a comment

marvinlanhenke Apr 29, 2024

sdd Apr 29, 2024

marvinlanhenke Apr 29, 2024

sdd commented May 4, 2024

marvinlanhenke left a comment

marvinlanhenke May 5, 2024

sdd May 5, 2024

sdd May 6, 2024

liurenjie1024 May 19, 2024

Fokko May 23, 2024

marvinlanhenke May 5, 2024

Fokko May 23, 2024

liurenjie1024 left a comment

liurenjie1024 May 19, 2024

Fokko left a comment

Fokko May 23, 2024

Fokko May 23, 2024

	self.visit_inequality(reference, datum, PartialOrd::lt, true)
	self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id))

Add InclusiveMetricsEvaluator #347

Add InclusiveMetricsEvaluator #347

Conversation

sdd commented Apr 23, 2024 • edited by liurenjie1024

marvinlanhenke left a comment

Choose a reason for hiding this comment

sdd commented Apr 29, 2024

marvinlanhenke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdd commented May 4, 2024

marvinlanhenke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `InclusiveMetricsEvaluator` #347

Add `InclusiveMetricsEvaluator` #347

sdd commented Apr 23, 2024 •

edited by liurenjie1024