Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ExpressionEvaluator #363

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

marvinlanhenke
Copy link
Contributor

@marvinlanhenke marvinlanhenke commented May 1, 2024

Which issue does this PR close?

Closes #358

Rationale for this change

  • adds the capability to prune DataFiles in TableScan

What changes are included in this PR?

  • feat: implementation of ExpressionEvaluator
  • feat: add expression_evaluator_cache & integrate in TableScan

Are these changes tested?

Yes, unit tests for expression evaluator are included.

@marvinlanhenke marvinlanhenke marked this pull request as draft May 1, 2024 04:48
@marvinlanhenke marvinlanhenke marked this pull request as ready for review May 2, 2024 18:59
@marvinlanhenke
Copy link
Contributor Author

@Fokko @liurenjie1024 @sdd
PTAL. Implementation based on pyiceberg.

@marvinlanhenke marvinlanhenke changed the title [WIP] feat: Add ExpressionEvaluator feat: add ExpressionEvaluator May 2, 2024
Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with just a small structural suggestion. Thanks @marvinlanhenke!

Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to consolidate some of the test setup for the recently-added visitors into a single place at some point as there is a lot of commonality and and it will keep the test section cleaner and easier to navigate, but I think we can address this in a future clean-up rather than in this PR. Looks great! Thanks!

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marvinlanhenke Thanks for this pr, and all the tests! Generally it looks good to me, but I have concern with reusing the Ord trait of PrimitiveLiteral. I think it's incorrect since it should only implement PartialOrd. Also I think we should only implement PartialOrd for Datum and use it in this filter.


/// Checks if the [`PrimitiveLiteral`] is null.
fn is_null(literal: &PrimitiveLiteral) -> bool {
if let PrimitiveLiteral::Boolean(false) = literal {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here, should not is_null just check Option<Datum> is none?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the DataFile contains a partition struct like:

let partition = Struct::from_iter([None]);

the Struct fields will contain a Literal::Primitive(PrimitiveLiteral::Boolean(false))

impl FromIterator<Option<Literal>> for Struct {
    fn from_iter<I: IntoIterator<Item = Option<Literal>>>(iter: I) -> Self {
        let mut fields = Vec::new();
        let mut null_bitmap = BitVec::new();

        for value in iter.into_iter() {
            match value {
                Some(value) => {
                    fields.push(value);
                    null_bitmap.push(false)
                }
                None => {
                    fields.push(Literal::Primitive(PrimitiveLiteral::Boolean(false)));
                    null_bitmap.push(true)
                }
            }
        }
        Struct {
            fields,
            null_bitmap,
        }
    }
}

thats why I check the Result<Datum> returned by the Accessor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect, PrimitiveLiteral::Boolean here is just a place holder, which could also be other things. Otherwise how do you distinguish it from actual value of PrimitiveLiteral::Boolean(false). And the Accessor should return Option<Literal>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise how do you distinguish it from actual value of PrimitiveLiteral::Boolean(false). And the Accessor should return Option<Literal>.

Yeah, this makes sense to me.
Regarding the Accessor will you open another issue or @sdd something you might want to take a look at?

Since I'm not to familiar with the Accessor and to verify my understanding; in order to return Option here we should check the null_bitmap of the Struct at the position and if the null_bitmap is true we can return None from the Accessor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here: #379

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @marvinlanhenke - I'll fix that Accessor bug, no problem.

return Ok(false);
}

Ok(datum.literal() < literal.literal())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect since PrimitiveLiteral should be partial order rather full order. My suggestion is to implement PartitionOrd for Datum and remove Ord for PrimitiveLiteral.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about removing Ord from PrimitiveLiteral since its also required by Literal. However, I think I get your point, so impl PartialOrd for Datum and then comparing those intstead of the primitive literals makes sense.

Perhaps like this:

impl PartialOrd for Datum {
    fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
        if self.r#type != other.r#type {
            return None;
        }

        self.literal.partial_cmp(&other.literal)
    }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why Literal should be Ord, what's the order of a struct compared with map? For Datum, I think we can start with this approach, but with further refinement to compare compatible types, for example float vs double, int vs long, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to make the changes as outlined in #378 to unblock this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement ExpressionEvaluator
3 participants