Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query functions #169

Merged
merged 35 commits into from
Jun 18, 2024
Merged

Query functions #169

merged 35 commits into from
Jun 18, 2024

Conversation

mkornacker
Copy link
Collaborator

No description provided.

@aaron-siegel
Copy link
Contributor

This change is Reviewable

@orm011
Copy link
Collaborator

orm011 commented May 15, 2024

tests/test_function.py line 246 at r1 (raw file):

    @pxt.query
    def lt_x(t, x: int) -> int:

I find this pattern of defining this test query function as a method within a test class confusing, what's the type of its first argument? (and what is the effect of the decorator in this context).

Can we just move it outside the class to make it clear what this is (I'm kind of using it as documentation)

Copy link
Collaborator Author

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 1 unresolved discussion (waiting on @orm011 and @udf)


tests/test_function.py line 246 at r1 (raw file):

Previously, orm011 (Oscar Moll) wrote…

I find this pattern of defining this test query function as a method within a test class confusing, what's the type of its first argument? (and what is the effect of the decorator in this context).

Can we just move it outside the class to make it clear what this is (I'm kind of using it as documentation)

We can do that, but I'm not quite following: how will that make anything clearer? It's going to be the same symbol, but you'll use it without the 'self.'

The type of the first argument is Table. Happy to add that.

The effect of the decorator is to turn it into an object that can then be used with Table.query(), analogously to @udf.

@orm011
Copy link
Collaborator

orm011 commented May 15, 2024

tests/test_function.py line 246 at r1 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

We can do that, but I'm not quite following: how will that make anything clearer? It's going to be the same symbol, but you'll use it without the 'self.'

The type of the first argument is Table. Happy to add that.

The effect of the decorator is to turn it into an object that can then be used with Table.query(), analogously to @udf.

What I mean is: if you didn't have that decorator, self.lt_x(1) would bind an instance of TestFunction to t and thats what t would be. The decorator somehow makes this work with t being a table, but its really not obvious to me what's going on here.

Copy link
Contributor

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 7 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)


pixeltable/dataframe.py line 346 at r2 (raw file):

    def _vars(self) -> dict[str, exprs.Variable]:
        """Return a dict mapping parameter name to Variable"""

Something like this: would be clearer (if I correctly understand what it's doing): "Returns a dict mapping parameter names to Variables, ranging over all Variables appearing in expressions that define this DataFrame."


pixeltable/dataframe.py line 366 at r2 (raw file):

    def parameters(self) -> dict[str, ColumnType]:
        """Return a dict mapping parameter name to parameter type"""

Same comment as above.


pixeltable/dataframe.py line 459 at r2 (raw file):

                exprs.Expr.list_substitute(order_by_exprs, var_expr, arg_expr)
        select_list = list(zip(select_list_exprs, self._column_names))
        order_by_clause: Optional[List[Tuple[exprs.Expr, bool]]] = None

use lowercase list/tuple


pixeltable/func/signature.py line 22 at r2 (raw file):

    col_type: Optional[ts.ColumnType]  # None for variable parameters
    kind: enum.Enum  # inspect.Parameter.kind; inspect._ParameterKind is private
    # for some reason, this needs to precede is_batched in the dataclass definition

What happens if it doesn't?


pixeltable/func/signature.py line 104 at r2 (raw file):

        result = {
            'return_type': self.get_return_type().as_dict(),
            'parameters': [p.as_dict() for p in self.parameters.values()],

These are some pretty substantial schema changes; we'll need to bump the schema version, right?


pixeltable/func/signature.py line 160 at r2 (raw file):

            param_types: Optional[List[ts.ColumnType]] = None
    ) -> List[Parameter]:
        assert (py_fn is None) != (py_params is None)

I see this pattern being used a lot, and I don't love it. Isn't it more typical in Python to have a single (non-Optional) parameter of type

Union[Callable, list[inspect.Parameter]]

? To me that makes the intended semantics much clearer.


tests/test_function.py line 246 at r1 (raw file):

Previously, orm011 (Oscar Moll) wrote…

What I mean is: if you didn't have that decorator, self.lt_x(1) would bind an instance of TestFunction to t and thats what t would be. The decorator somehow makes this work with t being a table, but its really not obvious to me what's going on here.

We should definitely add the type hint. I kinda need to wrap my head around this too though.

I think of @udf as an operator that "lifts" a Python function, which acts on values, to a Pixeltable Function, which acts on columns of values.

Is it correct to think of @query as doing the same thing but with an additional Table parameter: the Python function acts on (Table, values) and the lifted Function acts on (Table, columns)?

@udf
Copy link

udf commented May 15, 2024

Hi, I was accidentally mentioned here. Not sure how to opt out of the review on reviewable, or if i even have to.

Copy link
Contributor

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a Python decorator @udf that matches your username; it seems that github tags you every time we mention it. I'm not sure how to disable this behavior.

Reviewable status: 0 of 27 files reviewed, 7 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)

@udf
Copy link

udf commented May 16, 2024

Don't worry, I'm used to it. Putting it in monospace makes it not mention me, but if I'm tagged once then reviewable will see me as a mentionee and keep tagging me. And I've yet to find a way to "unmention" myself from reviewable.

Copy link

@udf udf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 23 files at r1, 5 of 5 files at r2, 5 of 7 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @mkornacker and @orm011)

@mkornacker mkornacker closed this Jun 8, 2024
@mkornacker mkornacker deleted the query branch June 8, 2024 16:51
@mkornacker mkornacker restored the query branch June 8, 2024 16:56
@mkornacker mkornacker reopened this Jun 8, 2024
Copy link
Contributor

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 31 files reviewed, 15 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)


pixeltable/dataframe.py line 378 at r6 (raw file):

        return {name: var.col_type for name, var in vars.items()}

    def _exec(self, conn: Optional[sql.engine.Connection] = None) -> Generator[exprs.DataRow, None, None]:

I think Iterator[DataRow] the preferred synonym for Generator[DataRow, None, None]?


pixeltable/dataframe.py line 407 at r6 (raw file):

        def exec_plan(conn: sql.engine.Connection) -> Generator[exprs.DataRow, None, None]:
            plan.ctx.set_conn(conn)
            plan.open()

Better to use a context manager pattern here e.g. something like:

with plan.open(conn):
    for row_batch in plan:
        # etc

It's generally preferred to a try/finally pattern with explicit close.

Not necessary to make the change for this review, but I think we should note this as a future refactoring of the planner code


pixeltable/dataframe.py line 735 at r6 (raw file):

        d = {
            '_classname': 'DataFrame',
            'tbl': self.tbl.as_dict(),

Does this entail a schema version change?


pixeltable/catalog/table.py line 56 at r6 (raw file):

    @property
    def _tbl_version(self) -> TableVersion:

I wasn't aware this existed until now!

But why is it self.tbl_version_path but self._tbl_version? Shouldn't they both (or neither) have underscores for consistency? That's probably why I never discovered it.


pixeltable/catalog/table.py line 61 at r6 (raw file):

    def __hash__(self) -> int:
        return hash(self._tbl_version().id)

Yikes! _tbl_version is now a property so this needs to be changed. This suggests this code path isn't being tested.


pixeltable/catalog/table.py line 76 at r6 (raw file):

        return getattr(self.tbl_version_path, name)

    def __getitem__(self, index: object) -> Union['pixeltable.exprs.ColumnRef', 'pixeltable.dataframe.DataFrame']:

__getitem__ needs to have pixeltable.func.QueryTemplateFunction as a type in its union too, correct? Or is this access pattern not supported for query functions?


pixeltable/catalog/table.py line 734 at r6 (raw file):

    @overload
    def query(self, *, param_types: Optional[List[ts.ColumnType]] = None) -> Callable: ...

Use a qualified return type

Callable[[Callable], QueryTemplateFunction]

so that the typechecker will know what the decorator spits out.


pixeltable/catalog/table.py line 764 at r6 (raw file):

        else:
            assert len(args) == 0 and len(kwargs) == 1 and 'param_types' in kwargs
            return lambda py_fn: decorator(py_fn, kwargs['param_types'])

This is kind of a weird pattern and might confuse the typechecker (and me).

It'd be clearer to use the name make_query_template (or some such) for the thing we're now calling decorator, and change this to:

        if len(args) == 1:
            assert len(kwargs) == 0 and callable(args[0])
            return make_query_template(args[0], None)
        else:
            assert len(args) == 0 and len(kwargs) == 1 and 'param_types' in kwargs
            def decorator(py_fn: Callable) -> 'pixeltable.func.QueryTemplateFunction':
                return make_query_template(py_fn, kwargs['param_types'])
            return decorator

It's slightly more verbose, but (1) the decorator now has proper type hints and (2) the thing we're calling decorator actually is a decorator.

Same comment for the code in expr_udf (which I guess I overlooked when I did that code review)


pixeltable/exprs/column_ref.py line 116 at r6 (raw file):

    @classmethod
    def _from_dict(cls, d: Dict, _: List[Expr]) -> Expr:

use lowercase dict/list

Copy link
Collaborator Author

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 33 files reviewed, 15 unresolved discussions (waiting on @aaron-siegel, @orm011, and @udf)


pixeltable/dataframe.py line 346 at r2 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Something like this: would be clearer (if I correctly understand what it's doing): "Returns a dict mapping parameter names to Variables, ranging over all Variables appearing in expressions that define this DataFrame."

Done.


pixeltable/dataframe.py line 366 at r2 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Same comment as above.

Done.


pixeltable/dataframe.py line 459 at r2 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

use lowercase list/tuple

Done.


pixeltable/dataframe.py line 378 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

I think Iterator[DataRow] the preferred synonym for Generator[DataRow, None, None]?

Done.


pixeltable/dataframe.py line 735 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Does this entail a schema version change?

Yes


pixeltable/catalog/table.py line 56 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

I wasn't aware this existed until now!

But why is it self.tbl_version_path but self._tbl_version? Shouldn't they both (or neither) have underscores for consistency? That's probably why I never discovered it.

Done.


pixeltable/catalog/table.py line 61 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Yikes! _tbl_version is now a property so this needs to be changed. This suggests this code path isn't being tested.

Done.


pixeltable/catalog/table.py line 76 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

__getitem__ needs to have pixeltable.func.QueryTemplateFunction as a type in its union too, correct? Or is this access pattern not supported for query functions?

That's in getattr() (I hate that it collapses lines w/o changes; if it didn't, you'd have seen getattr() directly above it).


pixeltable/catalog/table.py line 764 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

This is kind of a weird pattern and might confuse the typechecker (and me).

It'd be clearer to use the name make_query_template (or some such) for the thing we're now calling decorator, and change this to:

        if len(args) == 1:
            assert len(kwargs) == 0 and callable(args[0])
            return make_query_template(args[0], None)
        else:
            assert len(args) == 0 and len(kwargs) == 1 and 'param_types' in kwargs
            def decorator(py_fn: Callable) -> 'pixeltable.func.QueryTemplateFunction':
                return make_query_template(py_fn, kwargs['param_types'])
            return decorator

It's slightly more verbose, but (1) the decorator now has proper type hints and (2) the thing we're calling decorator actually is a decorator.

Same comment for the code in expr_udf (which I guess I overlooked when I did that code review)

Done.


pixeltable/exprs/column_ref.py line 116 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

use lowercase dict/list

Done.


pixeltable/func/signature.py line 22 at r2 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

What happens if it doesn't?

Done.


pixeltable/func/signature.py line 104 at r2 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

These are some pretty substantial schema changes; we'll need to bump the schema version, right?

Done.


pixeltable/func/signature.py line 160 at r2 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

I see this pattern being used a lot, and I don't love it. Isn't it more typical in Python to have a single (non-Optional) parameter of type

Union[Callable, list[inspect.Parameter]]

? To me that makes the intended semantics much clearer.

I absolutely hate unions, especially in this case, where they really are two completely different things (which to me indicates that they should not be called the same thing). Also, doing that will/would make typechecking useless.


tests/test_function.py line 246 at r1 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

We should definitely add the type hint. I kinda need to wrap my head around this too though.

I think of @udf as an operator that "lifts" a Python function, which acts on values, to a Pixeltable Function, which acts on columns of values.

Is it correct to think of @query as doing the same thing but with an additional Table parameter: the Python function acts on (Table, values) and the lifted Function acts on (Table, columns)?

Let's discuss verbally.

Copy link
Contributor

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)


pixeltable/dataframe.py line 453 at r8 (raw file):

        where_clause = copy.deepcopy(self.where_clause)
        group_by_clause = copy.deepcopy(self.group_by_clause)
        order_by_exprs = copy.deepcopy(

Two random stylistic comments:

[copy.deepcopy(order_by_expr) for order_by_expr, _ in self.order_by_clause]

is a little clearer to me, with the advantage that we're not creating a list just to copy it. And,

<expression> if self.order_by_clause is not None else None

can be shortened to

<expression> if self.order_by_clause else None

(I'm assuming Expr instances are always truthy.) I've started preferring the latter, though I'm not aware of an established convention for this.


pixeltable/catalog/table.py line 76 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

That's in getattr() (I hate that it collapses lines w/o changes; if it didn't, you'd have seen getattr() directly above it).

I understand, but doesn't __getitem__ need it too? I'm assuming that t['func_name'] can resolve a query function just like t.func_name; if not, I think it should (otherwise the lack of parity between them is a little confusing)


pixeltable/catalog/table.py line 764 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Done.

lgtm. And ah, I guess the lambda is ok because the type hint is on the overloaded method, so we don't need another type hint here.


pixeltable/catalog/table.py line 70 at r8 (raw file):

            self, name: str
    ) -> Union['pixeltable.exprs.ColumnRef', 'pixeltable.func.QueryTemplateFunction']:
        """Return a ColumnRef for the given column name.

docstring should be updated since this can return more than a ColumnRef now


pixeltable/metadata/converters/convert_15.py line 16 at r8 (raw file):

def convert_15(engine: sql.engine.Engine) -> None:

Also add some @query instances to create_test_db_dump. (test_migration also tests loading the v15 dump from scratch, in addition to migrating older dumps, so we get validation of the @query metadata write/read loop)


pixeltable/metadata/converters/convert_15.py line 34 at r8 (raw file):

        col_type = ts.ColumnType.from_dict(col_type_dict) if col_type_dict is not None else None
        default = py_params[name].default
        kind = inspect._ParameterKind(kind_int)  # is there a way to avoid referecing a private type?

there doesn't appear to be, other than maybe?

KINDS = [POSITIONAL_ONLY, POSITIONAL_OR_KEYWORD, VAR_POSITIONAL, KEYWORD_ONLY, VAR_KEYWORD]

# later:
kind = KINDS[kind_int]

I don't think construction of inspect.Parameters is meant to be a public facing operation; the intent is they only show up as the output of inspect.signature and so on. It might be risky to regard the _ParameterKind enum as anything other than an implementation detail of the inspect.Parameter.POSITIONAL_ONLY (etc) constants.

(I do see that this is only being used in the converter & scrupulously avoided elsewhere)


tests/test_function.py line 246 at r1 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Let's discuss verbally.

My previous comment is obsolete; the @query pattern has changed a lot since then, and I'm happy with the outcome.

Copy link
Collaborator Author

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @aaron-siegel, @orm011, and @udf)


pixeltable/dataframe.py line 453 at r8 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Two random stylistic comments:

[copy.deepcopy(order_by_expr) for order_by_expr, _ in self.order_by_clause]

is a little clearer to me, with the advantage that we're not creating a list just to copy it. And,

<expression> if self.order_by_clause is not None else None

can be shortened to

<expression> if self.order_by_clause else None

(I'm assuming Expr instances are always truthy.) I've started preferring the latter, though I'm not aware of an established convention for this.

Done.

I left the explicit comparison to None in place, though - the Google style guide wants that.


pixeltable/catalog/table.py line 76 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

I understand, but doesn't __getitem__ need it too? I'm assuming that t['func_name'] can resolve a query function just like t.func_name; if not, I think it should (otherwise the lack of parity between them is a little confusing)

Can we leave this out of this? It's unclear whether we want to keep getitem for anything other than column references, because you can have non-id column names. But that doesn't apply to queries.


pixeltable/catalog/table.py line 70 at r8 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

docstring should be updated since this can return more than a ColumnRef now

Done.


pixeltable/metadata/converters/convert_15.py line 16 at r8 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Also add some @query instances to create_test_db_dump. (test_migration also tests loading the v15 dump from scratch, in addition to migrating older dumps, so we get validation of the @query metadata write/read loop)

Done.

@aaron-siegel aaron-siegel requested a review from orm011 June 18, 2024 05:47
Copy link
Contributor

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)


pixeltable/catalog/table.py line 76 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Can we leave this out of this? It's unclear whether we want to keep getitem for anything other than column references, because you can have non-id column names. But that doesn't apply to queries.

Sure, we don't have to resolve it in this PR. But I tend to think it'll confuse ppl if t.my_symbol and t['my_symbol'] behave similarly, but not the same. The vast majority of users' exposure to these patterns will involve dereferencing columns, and as a result they'll expect them to be synonyms, and likely be surprised to find that's only the case for columns and not queries. (I'd be surprised.)


pixeltable/tool/create_test_db_dump.py line 252 at r9 (raw file):

            #return t.where(t.c2 < i)
            return t.where(t.c2 < i).select(t.c1, t.c2)
        add_column('query_output', t.q1(t.c2))

it'd be good to have an instance that uses similarity too, since SimilarityExpr can't (currently) be assigned directly to a column, and therefore isn't yet covered by the db dumps (that is, I believe the only way to get a SimilarityExpr into the dict is through a @query, but pls correct me if that's wrong).

Copy link
Collaborator Author

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @aaron-siegel, @orm011, and @udf)


pixeltable/catalog/table.py line 76 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Sure, we don't have to resolve it in this PR. But I tend to think it'll confuse ppl if t.my_symbol and t['my_symbol'] behave similarly, but not the same. The vast majority of users' exposure to these patterns will involve dereferencing columns, and as a result they'll expect them to be synonyms, and likely be surprised to find that's only the case for columns and not queries. (I'd be surprised.)

Done.


pixeltable/tool/create_test_db_dump.py line 252 at r9 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

it'd be good to have an instance that uses similarity too, since SimilarityExpr can't (currently) be assigned directly to a column, and therefore isn't yet covered by the db dumps (that is, I believe the only way to get a SimilarityExpr into the dict is through a @query, but pls correct me if that's wrong).

Done.

@mkornacker mkornacker merged commit a1ea537 into pixeltable:master Jun 18, 2024
14 checks passed
@mkornacker mkornacker deleted the query branch June 18, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants