Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InclusiveMetricsEvaluator #347

Merged
merged 2 commits into from
May 19, 2024

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Apr 23, 2024

InclusiveMetricsEvaluator is used inside table scans to filter DataFile entries within a Manifest, rejecting any of them if their metrics indicate that they cannot contain any rows that match the predicate filter.

Comments / suggestions are welcome.

Closes #127

@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch 10 times, most recently from 3335586 to 058b8c5 Compare April 26, 2024 08:38
Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd this looks nice! I had just one comment/ question about the caching of the InclusiveMetricsEvaluator and some nits abouts either names, or using "early returns" / match arms, instead of nested if/else constructs. But I'll guess those nits are more personal preferences - so I'm interested what others have to say here. Thanks again for your work here.

crates/iceberg/src/scan.rs Outdated Show resolved Hide resolved
crates/iceberg/src/scan.rs Outdated Show resolved Hide resolved
@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch 2 times, most recently from f39ebc5 to dd419b3 Compare April 29, 2024 06:50
@sdd
Copy link
Contributor Author

sdd commented Apr 29, 2024

@marvinlanhenke Thanks for taking the time for the great review. Have addressed your raised points. I'm adding tests based on https://github.com/apache/iceberg/blob/main/api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java and will submit them over the next day or so, once complete.

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdd thanks LGTM.
Just had an idea regarding the bound_predicate check. Perhaps you can think about it.
Let's see what others have to say and take a look at #360 which probably introduces some merge conflicts.

/// see if this `DataFile` contains data that could match
/// the scan's filter.
pub(crate) fn eval(
filter: &'a BoundPredicate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an idea:
Can we accept &'a Option<BoundPredicate> here?
This way we can move the check if we have a row_filter at all from scan.rs into the InclusiveMetricsEvaluator itself and simply return ROWS_MIGHT_MATCH if BoundPredicate is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like that. It results in less code but it doesn't feel right, semantically. I'll have a think to see if there's a more concise way to do this within TableScan. Perhaps a shorter code path if there is no filter, with a longer path if there is one, but that might involve a bit of duplication. Alternatively we could set the filter predicate to AlwaysTrue if none is supplied to the scan.

@@ -218,6 +230,18 @@ impl TableScan {

let mut manifest_entries = iter(manifest.entries().iter().filter(|e| e.is_alive()));
while let Some(manifest_entry) = manifest_entries.next().await {

if let Some(ref bound_predicate) = bound_predicate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the check into InclusiveMetricsEvaluator (see other comment), which returns ROWS_MIGHT_MATCH if bound_predicate is None. This way we could entangle the already involved plan_files method?

@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch 2 times, most recently from e3fa225 to 34ca88e Compare April 30, 2024 18:52
@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch from 34ca88e to 09ec3c4 Compare May 4, 2024 00:04
@sdd sdd changed the title [WIP]: Add InclusiveMetricsEvaluator Add InclusiveMetricsEvaluator May 4, 2024
@sdd sdd marked this pull request as ready for review May 4, 2024 08:22
@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch 3 times, most recently from 1cf1d5e to 4432826 Compare May 4, 2024 15:49
@sdd
Copy link
Contributor Author

sdd commented May 4, 2024

FAO @Fokko @liurenjie1024 @marvinlanhenke:

I've finished adding tests for this - it's ready for review, PTAL! 😄

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdd
Thank you so much for your work here. LGTM. I had just one question and some minor nits. LGTM!! And thanks for porting the test-suite - this is a lot of work.

if !include_empty_files && data_file.record_count == 0 {
return ROWS_CANNOT_MATCH;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this extra check (for older versions) as well, or can we ignore this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko what's your opinion on this? I don't know enough about why this is here in the other implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_count is a u64, so it can't ever have a value of -1, and so that check doesn't make sense to have in iceberg-rust as it is right now. Not unless we change record_count to be an i64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we may throw an error, since it's u64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I was touching the grass. I think it is fair to leave out the check and rely on u64 👍

datum: &Datum,
_predicate: &BoundPredicate,
) -> crate::Result<bool> {
self.visit_inequality(reference, datum, PartialOrd::lt, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it as well, very elegant 👍 Out of curiosity, is there a rust-argument about why not passing in the bound directly:

Suggested change
self.visit_inequality(reference, datum, PartialOrd::lt, true)
self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id))

@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch 2 times, most recently from 5260ae6 to 1a2c925 Compare May 6, 2024 10:43
@sdd sdd force-pushed the add-inclusive-metrics-evaluator branch from 1a2c925 to 072ee2a Compare May 6, 2024 10:47
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sdd for this great pr and all the tests!

if !include_empty_files && data_file.record_count == 0 {
return ROWS_CANNOT_MATCH;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we may throw an error, since it's u64.

@liurenjie1024 liurenjie1024 merged commit e1c10b5 into apache:main May 19, 2024
6 checks passed
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks for working on this @sdd and porting all the tests. I did the same for PyIceberg and it is quite a bit of work 👍

let nan_count = self.nan_count(field_id);
let value_count = self.value_count(field_id);

nan_count.is_some() && nan_count == value_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to check if value_count is not null?

datum: &Datum,
_predicate: &BoundPredicate,
) -> crate::Result<bool> {
self.visit_inequality(reference, datum, PartialOrd::lt, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it as well, very elegant 👍 Out of curiosity, is there a rust-argument about why not passing in the bound directly:

Suggested change
self.visit_inequality(reference, datum, PartialOrd::lt, true)
self.visit_inequality(reference, datum, PartialOrd::lt, self.lower_bound(field_id))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pruning data files according to index, statistics when reading.
4 participants