Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BoundPredicateVisitor trait for ManifestFilterVisitor #367

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

s-akhtar-baig
Copy link
Contributor

@s-akhtar-baig s-akhtar-baig commented May 9, 2024

GitHub issue: #350

Description: ManifestEvaluator was implemented in #322 whereas some functions were unimplemented. This PR implements the remaining functions and adds most of the Python unit tests.

Testing: Added new unit tests.

Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks so much for the contribution! Just a few small issues that are straightforward to resolve. 🙌🏼

crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
@s-akhtar-baig
Copy link
Contributor Author

@sdd, thank you for reviewing the changes and providing references! I have modified my code based on your suggestions. Please take a look and let me know if I miss anything.

@s-akhtar-baig s-akhtar-baig requested a review from sdd May 10, 2024 20:42
Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost there! Just a couple of small stylistic changes required and then I'm happy.

Thanks again! 😁

return ROWS_MIGHT_MATCH;
}

if let Some(Literal::Primitive(lower_bound)) = &field.lower_bound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner! Thanks :-)

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @s-akhtar-baig which looks really good - left some minor comments. However, please check the comment about 'comparison' and the and implementation and verify its correct.

crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
return ROWS_CANNOT_MATCH;
}

if self.are_all_null(field, &reference.field().field_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this check redundant? If the partition contains no NaN values, we don't need to check if all values are null. If all values are null it cannot contain any NaN values - but we already know that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to @sdd's comment #367 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marvinlanhenke if field.contains_nan is None rather than Some(false) then this implies that the metrics don't indicate if the fields contain NaN or not. So the subsequent check for all nulls is still valid?

crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved

let prefix_len = prefix.chars().count();

if let Some(lower_bound) = &field.lower_bound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: extract into helper fn, since its used in multiple places?

return ROWS_MIGHT_MATCH;
}

let truncated_upper_bound =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the extra String allocation by using char_indices() to get the prefix_len and then use a slice for comparison let truncated_upper_bound = &upper_bound[..prefix_len]; // haven't tested it though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to @sdd's comment #367 (comment).

Copy link
Contributor

@sdd sdd May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @marvinlanhenke's suggestion is indeed a better one than mine here

Copy link
Contributor Author

@s-akhtar-baig s-akhtar-baig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marvinlanhenke @sdd, thank you for reviewing. I have modified the code accordingly and added comments for clarification. Let me know what you think.

return ROWS_MIGHT_MATCH;
}

let truncated_upper_bound =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to @sdd's comment #367 (comment).

return ROWS_CANNOT_MATCH;
}

if self.are_all_null(field, &reference.field().field_type) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to @sdd's comment #367 (comment).

crates/iceberg/src/expr/visitors/manifest_evaluator.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @s-akhtar-baig for this great pr, it looks great! I left some questions about the confusing part. Also I think one important thing is that we should not rely one the Ord of PrimitiveLiteral.

let field = self.field_summary_for_reference(reference);

if field.lower_bound.is_none() || field.upper_bound.is_none() {
return ROWS_CANNOT_MATCH;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's ROWS_CANNOT_MATCH? I think if either is missing, we can't exclue it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liurenjie1024, I followed Python implementation https://github.com/apache/iceberg-python/blob/20f6afdf5f000ea5b167e804012f2000aa5b8573/pyiceberg/expressions/visitors.py#L639.

Please let me know if this is incorrect and if there is a different spec that I needed to follow.

_predicate: &BoundPredicate,
) -> crate::Result<bool> {
todo!()
// because the bounds are not necessarily a min or max value, this cannot be answered using
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confusing here, why lower/upper bound are not necessarily min/max value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let field = self.field_summary_for_reference(reference);

if field.lower_bound.is_none() || field.upper_bound.is_none() {
return ROWS_CANNOT_MATCH;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, why either is none, we can't match it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants