-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BoundPredicateVisitor trait for ManifestFilterVisitor #367
base: main
Are you sure you want to change the base?
Implement BoundPredicateVisitor trait for ManifestFilterVisitor #367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks so much for the contribution! Just a few small issues that are straightforward to resolve. 🙌🏼
@sdd, thank you for reviewing the changes and providing references! I have modified my code based on your suggestions. Please take a look and let me know if I miss anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost there! Just a couple of small stylistic changes required and then I'm happy.
Thanks again! 😁
return ROWS_MIGHT_MATCH; | ||
} | ||
|
||
if let Some(Literal::Primitive(lower_bound)) = &field.lower_bound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner! Thanks :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @s-akhtar-baig which looks really good - left some minor comments. However, please check the comment about 'comparison' and the and
implementation and verify its correct.
return ROWS_CANNOT_MATCH; | ||
} | ||
|
||
if self.are_all_null(field, &reference.field().field_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this check redundant? If the partition contains no NaN values, we don't need to check if all values are null. If all values are null it cannot contain any NaN values - but we already know that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to @sdd's comment #367 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marvinlanhenke if field.contains_nan is None
rather than Some(false)
then this implies that the metrics don't indicate if the fields contain NaN or not. So the subsequent check for all nulls is still valid?
|
||
let prefix_len = prefix.chars().count(); | ||
|
||
if let Some(lower_bound) = &field.lower_bound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: extract into helper fn, since its used in multiple places?
return ROWS_MIGHT_MATCH; | ||
} | ||
|
||
let truncated_upper_bound = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid the extra String allocation by using char_indices()
to get the prefix_len and then use a slice for comparison let truncated_upper_bound = &upper_bound[..prefix_len];
// haven't tested it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to @sdd's comment #367 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @marvinlanhenke's suggestion is indeed a better one than mine here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marvinlanhenke @sdd, thank you for reviewing. I have modified the code accordingly and added comments for clarification. Let me know what you think.
return ROWS_MIGHT_MATCH; | ||
} | ||
|
||
let truncated_upper_bound = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to @sdd's comment #367 (comment).
return ROWS_CANNOT_MATCH; | ||
} | ||
|
||
if self.are_all_null(field, &reference.field().field_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to @sdd's comment #367 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @s-akhtar-baig for this great pr, it looks great! I left some questions about the confusing part. Also I think one important thing is that we should not rely one the Ord
of PrimitiveLiteral
.
let field = self.field_summary_for_reference(reference); | ||
|
||
if field.lower_bound.is_none() || field.upper_bound.is_none() { | ||
return ROWS_CANNOT_MATCH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's ROWS_CANNOT_MATCH
? I think if either is missing, we can't exclue it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024, I followed Python implementation https://github.com/apache/iceberg-python/blob/20f6afdf5f000ea5b167e804012f2000aa5b8573/pyiceberg/expressions/visitors.py#L639.
Please let me know if this is incorrect and if there is a different spec that I needed to follow.
_predicate: &BoundPredicate, | ||
) -> crate::Result<bool> { | ||
todo!() | ||
// because the bounds are not necessarily a min or max value, this cannot be answered using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confusing here, why lower/upper bound are not necessarily min/max value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let field = self.field_summary_for_reference(reference); | ||
|
||
if field.lower_bound.is_none() || field.upper_bound.is_none() { | ||
return ROWS_CANNOT_MATCH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, why either is none, we can't match it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here. Followed https://github.com/apache/iceberg-python/blob/20f6afdf5f000ea5b167e804012f2000aa5b8573/pyiceberg/expressions/visitors.py#L731. Collapsed if statements on L722 and L731.
GitHub issue: #350
Description: ManifestEvaluator was implemented in #322 whereas some functions were unimplemented. This PR implements the remaining functions and adds most of the Python unit tests.
Testing: Added new unit tests.