-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query functions #169
Query functions #169
Conversation
I find this pattern of defining this test query function as a method within a test class confusing, what's the type of its first argument? (and what is the effect of the decorator in this context). Can we just move it outside the class to make it clear what this is (I'm kind of using it as documentation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 23 files reviewed, 1 unresolved discussion (waiting on @orm011 and @udf)
tests/test_function.py
line 246 at r1 (raw file):
Previously, orm011 (Oscar Moll) wrote…
I find this pattern of defining this test query function as a method within a test class confusing, what's the type of its first argument? (and what is the effect of the decorator in this context).
Can we just move it outside the class to make it clear what this is (I'm kind of using it as documentation)
We can do that, but I'm not quite following: how will that make anything clearer? It's going to be the same symbol, but you'll use it without the 'self.'
The type of the first argument is Table. Happy to add that.
The effect of the decorator is to turn it into an object that can then be used with Table.query(), analogously to @udf.
Previously, mkornacker (Marcel Kornacker) wrote…
What I mean is: if you didn't have that decorator, self.lt_x(1) would bind an instance of TestFunction to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 23 files reviewed, 7 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)
pixeltable/dataframe.py
line 346 at r2 (raw file):
def _vars(self) -> dict[str, exprs.Variable]: """Return a dict mapping parameter name to Variable"""
Something like this: would be clearer (if I correctly understand what it's doing): "Returns a dict mapping parameter names to Variables, ranging over all Variables appearing in expressions that define this DataFrame."
pixeltable/dataframe.py
line 366 at r2 (raw file):
def parameters(self) -> dict[str, ColumnType]: """Return a dict mapping parameter name to parameter type"""
Same comment as above.
pixeltable/dataframe.py
line 459 at r2 (raw file):
exprs.Expr.list_substitute(order_by_exprs, var_expr, arg_expr) select_list = list(zip(select_list_exprs, self._column_names)) order_by_clause: Optional[List[Tuple[exprs.Expr, bool]]] = None
use lowercase list/tuple
pixeltable/func/signature.py
line 22 at r2 (raw file):
col_type: Optional[ts.ColumnType] # None for variable parameters kind: enum.Enum # inspect.Parameter.kind; inspect._ParameterKind is private # for some reason, this needs to precede is_batched in the dataclass definition
What happens if it doesn't?
pixeltable/func/signature.py
line 104 at r2 (raw file):
result = { 'return_type': self.get_return_type().as_dict(), 'parameters': [p.as_dict() for p in self.parameters.values()],
These are some pretty substantial schema changes; we'll need to bump the schema version, right?
pixeltable/func/signature.py
line 160 at r2 (raw file):
param_types: Optional[List[ts.ColumnType]] = None ) -> List[Parameter]: assert (py_fn is None) != (py_params is None)
I see this pattern being used a lot, and I don't love it. Isn't it more typical in Python to have a single (non-Optional) parameter of type
Union[Callable, list[inspect.Parameter]]
? To me that makes the intended semantics much clearer.
tests/test_function.py
line 246 at r1 (raw file):
Previously, orm011 (Oscar Moll) wrote…
What I mean is: if you didn't have that decorator, self.lt_x(1) would bind an instance of TestFunction to
t
and thats what t would be. The decorator somehow makes this work with t being a table, but its really not obvious to me what's going on here.
We should definitely add the type hint. I kinda need to wrap my head around this too though.
I think of @udf
as an operator that "lifts" a Python function, which acts on values, to a Pixeltable Function, which acts on columns of values.
Is it correct to think of @query
as doing the same thing but with an additional Table
parameter: the Python function acts on (Table, values) and the lifted Function acts on (Table, columns)?
Hi, I was accidentally mentioned here. Not sure how to opt out of the review on reviewable, or if i even have to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a Python decorator @udf
that matches your username; it seems that github tags you every time we mention it. I'm not sure how to disable this behavior.
Reviewable status: 0 of 27 files reviewed, 7 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)
Don't worry, I'm used to it. Putting it in monospace makes it not mention me, but if I'm tagged once then reviewable will see me as a mentionee and keep tagging me. And I've yet to find a way to "unmention" myself from reviewable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 23 files at r1, 5 of 5 files at r2, 5 of 7 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @mkornacker and @orm011)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 31 files reviewed, 15 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)
pixeltable/dataframe.py
line 378 at r6 (raw file):
return {name: var.col_type for name, var in vars.items()} def _exec(self, conn: Optional[sql.engine.Connection] = None) -> Generator[exprs.DataRow, None, None]:
I think Iterator[DataRow]
the preferred synonym for Generator[DataRow, None, None]
?
pixeltable/dataframe.py
line 407 at r6 (raw file):
def exec_plan(conn: sql.engine.Connection) -> Generator[exprs.DataRow, None, None]: plan.ctx.set_conn(conn) plan.open()
Better to use a context manager pattern here e.g. something like:
with plan.open(conn):
for row_batch in plan:
# etc
It's generally preferred to a try/finally pattern with explicit close.
Not necessary to make the change for this review, but I think we should note this as a future refactoring of the planner code
pixeltable/dataframe.py
line 735 at r6 (raw file):
d = { '_classname': 'DataFrame', 'tbl': self.tbl.as_dict(),
Does this entail a schema version change?
pixeltable/catalog/table.py
line 56 at r6 (raw file):
@property def _tbl_version(self) -> TableVersion:
I wasn't aware this existed until now!
But why is it self.tbl_version_path
but self._tbl_version
? Shouldn't they both (or neither) have underscores for consistency? That's probably why I never discovered it.
pixeltable/catalog/table.py
line 61 at r6 (raw file):
def __hash__(self) -> int: return hash(self._tbl_version().id)
Yikes! _tbl_version
is now a property so this needs to be changed. This suggests this code path isn't being tested.
pixeltable/catalog/table.py
line 76 at r6 (raw file):
return getattr(self.tbl_version_path, name) def __getitem__(self, index: object) -> Union['pixeltable.exprs.ColumnRef', 'pixeltable.dataframe.DataFrame']:
__getitem__
needs to have pixeltable.func.QueryTemplateFunction
as a type in its union too, correct? Or is this access pattern not supported for query functions?
pixeltable/catalog/table.py
line 734 at r6 (raw file):
@overload def query(self, *, param_types: Optional[List[ts.ColumnType]] = None) -> Callable: ...
Use a qualified return type
Callable[[Callable], QueryTemplateFunction]
so that the typechecker will know what the decorator spits out.
pixeltable/catalog/table.py
line 764 at r6 (raw file):
else: assert len(args) == 0 and len(kwargs) == 1 and 'param_types' in kwargs return lambda py_fn: decorator(py_fn, kwargs['param_types'])
This is kind of a weird pattern and might confuse the typechecker (and me).
It'd be clearer to use the name make_query_template
(or some such) for the thing we're now calling decorator
, and change this to:
if len(args) == 1:
assert len(kwargs) == 0 and callable(args[0])
return make_query_template(args[0], None)
else:
assert len(args) == 0 and len(kwargs) == 1 and 'param_types' in kwargs
def decorator(py_fn: Callable) -> 'pixeltable.func.QueryTemplateFunction':
return make_query_template(py_fn, kwargs['param_types'])
return decorator
It's slightly more verbose, but (1) the decorator now has proper type hints and (2) the thing we're calling decorator
actually is a decorator.
Same comment for the code in expr_udf
(which I guess I overlooked when I did that code review)
pixeltable/exprs/column_ref.py
line 116 at r6 (raw file):
@classmethod def _from_dict(cls, d: Dict, _: List[Expr]) -> Expr:
use lowercase dict/list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 33 files reviewed, 15 unresolved discussions (waiting on @aaron-siegel, @orm011, and @udf)
pixeltable/dataframe.py
line 346 at r2 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Something like this: would be clearer (if I correctly understand what it's doing): "Returns a dict mapping parameter names to Variables, ranging over all Variables appearing in expressions that define this DataFrame."
Done.
pixeltable/dataframe.py
line 366 at r2 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Same comment as above.
Done.
pixeltable/dataframe.py
line 459 at r2 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
use lowercase list/tuple
Done.
pixeltable/dataframe.py
line 378 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
I think
Iterator[DataRow]
the preferred synonym forGenerator[DataRow, None, None]
?
Done.
pixeltable/dataframe.py
line 735 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Does this entail a schema version change?
Yes
pixeltable/catalog/table.py
line 56 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
I wasn't aware this existed until now!
But why is it
self.tbl_version_path
butself._tbl_version
? Shouldn't they both (or neither) have underscores for consistency? That's probably why I never discovered it.
Done.
pixeltable/catalog/table.py
line 61 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Yikes!
_tbl_version
is now a property so this needs to be changed. This suggests this code path isn't being tested.
Done.
pixeltable/catalog/table.py
line 76 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
__getitem__
needs to havepixeltable.func.QueryTemplateFunction
as a type in its union too, correct? Or is this access pattern not supported for query functions?
That's in getattr() (I hate that it collapses lines w/o changes; if it didn't, you'd have seen getattr() directly above it).
pixeltable/catalog/table.py
line 764 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
This is kind of a weird pattern and might confuse the typechecker (and me).
It'd be clearer to use the name
make_query_template
(or some such) for the thing we're now callingdecorator
, and change this to:if len(args) == 1: assert len(kwargs) == 0 and callable(args[0]) return make_query_template(args[0], None) else: assert len(args) == 0 and len(kwargs) == 1 and 'param_types' in kwargs def decorator(py_fn: Callable) -> 'pixeltable.func.QueryTemplateFunction': return make_query_template(py_fn, kwargs['param_types']) return decorator
It's slightly more verbose, but (1) the decorator now has proper type hints and (2) the thing we're calling
decorator
actually is a decorator.Same comment for the code in
expr_udf
(which I guess I overlooked when I did that code review)
Done.
pixeltable/exprs/column_ref.py
line 116 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
use lowercase dict/list
Done.
pixeltable/func/signature.py
line 22 at r2 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
What happens if it doesn't?
Done.
pixeltable/func/signature.py
line 104 at r2 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
These are some pretty substantial schema changes; we'll need to bump the schema version, right?
Done.
pixeltable/func/signature.py
line 160 at r2 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
I see this pattern being used a lot, and I don't love it. Isn't it more typical in Python to have a single (non-Optional) parameter of type
Union[Callable, list[inspect.Parameter]]
? To me that makes the intended semantics much clearer.
I absolutely hate unions, especially in this case, where they really are two completely different things (which to me indicates that they should not be called the same thing). Also, doing that will/would make typechecking useless.
tests/test_function.py
line 246 at r1 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
We should definitely add the type hint. I kinda need to wrap my head around this too though.
I think of
@udf
as an operator that "lifts" a Python function, which acts on values, to a Pixeltable Function, which acts on columns of values.Is it correct to think of
@query
as doing the same thing but with an additionalTable
parameter: the Python function acts on (Table, values) and the lifted Function acts on (Table, columns)?
Let's discuss verbally.
making Table's data members private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)
pixeltable/dataframe.py
line 453 at r8 (raw file):
where_clause = copy.deepcopy(self.where_clause) group_by_clause = copy.deepcopy(self.group_by_clause) order_by_exprs = copy.deepcopy(
Two random stylistic comments:
[copy.deepcopy(order_by_expr) for order_by_expr, _ in self.order_by_clause]
is a little clearer to me, with the advantage that we're not creating a list just to copy it. And,
<expression> if self.order_by_clause is not None else None
can be shortened to
<expression> if self.order_by_clause else None
(I'm assuming Expr instances are always truthy.) I've started preferring the latter, though I'm not aware of an established convention for this.
pixeltable/catalog/table.py
line 76 at r6 (raw file):
Previously, mkornacker (Marcel Kornacker) wrote…
That's in getattr() (I hate that it collapses lines w/o changes; if it didn't, you'd have seen getattr() directly above it).
I understand, but doesn't __getitem__
need it too? I'm assuming that t['func_name']
can resolve a query function just like t.func_name
; if not, I think it should (otherwise the lack of parity between them is a little confusing)
pixeltable/catalog/table.py
line 764 at r6 (raw file):
Previously, mkornacker (Marcel Kornacker) wrote…
Done.
lgtm. And ah, I guess the lambda is ok because the type hint is on the overloaded method, so we don't need another type hint here.
pixeltable/catalog/table.py
line 70 at r8 (raw file):
self, name: str ) -> Union['pixeltable.exprs.ColumnRef', 'pixeltable.func.QueryTemplateFunction']: """Return a ColumnRef for the given column name.
docstring should be updated since this can return more than a ColumnRef
now
pixeltable/metadata/converters/convert_15.py
line 16 at r8 (raw file):
def convert_15(engine: sql.engine.Engine) -> None:
Also add some @query
instances to create_test_db_dump
. (test_migration
also tests loading the v15 dump from scratch, in addition to migrating older dumps, so we get validation of the @query
metadata write/read loop)
pixeltable/metadata/converters/convert_15.py
line 34 at r8 (raw file):
col_type = ts.ColumnType.from_dict(col_type_dict) if col_type_dict is not None else None default = py_params[name].default kind = inspect._ParameterKind(kind_int) # is there a way to avoid referecing a private type?
there doesn't appear to be, other than maybe?
KINDS = [POSITIONAL_ONLY, POSITIONAL_OR_KEYWORD, VAR_POSITIONAL, KEYWORD_ONLY, VAR_KEYWORD]
# later:
kind = KINDS[kind_int]
I don't think construction of inspect.Parameter
s is meant to be a public facing operation; the intent is they only show up as the output of inspect.signature
and so on. It might be risky to regard the _ParameterKind
enum as anything other than an implementation detail of the inspect.Parameter.POSITIONAL_ONLY
(etc) constants.
(I do see that this is only being used in the converter & scrupulously avoided elsewhere)
tests/test_function.py
line 246 at r1 (raw file):
Previously, mkornacker (Marcel Kornacker) wrote…
Let's discuss verbally.
My previous comment is obsolete; the @query
pattern has changed a lot since then, and I'm happy with the outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @aaron-siegel, @orm011, and @udf)
pixeltable/dataframe.py
line 453 at r8 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Two random stylistic comments:
[copy.deepcopy(order_by_expr) for order_by_expr, _ in self.order_by_clause]
is a little clearer to me, with the advantage that we're not creating a list just to copy it. And,
<expression> if self.order_by_clause is not None else None
can be shortened to
<expression> if self.order_by_clause else None
(I'm assuming Expr instances are always truthy.) I've started preferring the latter, though I'm not aware of an established convention for this.
Done.
I left the explicit comparison to None in place, though - the Google style guide wants that.
pixeltable/catalog/table.py
line 76 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
I understand, but doesn't
__getitem__
need it too? I'm assuming thatt['func_name']
can resolve a query function just liket.func_name
; if not, I think it should (otherwise the lack of parity between them is a little confusing)
Can we leave this out of this? It's unclear whether we want to keep getitem for anything other than column references, because you can have non-id column names. But that doesn't apply to queries.
pixeltable/catalog/table.py
line 70 at r8 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
docstring should be updated since this can return more than a
ColumnRef
now
Done.
pixeltable/metadata/converters/convert_15.py
line 16 at r8 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Also add some
@query
instances tocreate_test_db_dump
. (test_migration
also tests loading the v15 dump from scratch, in addition to migrating older dumps, so we get validation of the@query
metadata write/read loop)
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @mkornacker, @orm011, and @udf)
pixeltable/catalog/table.py
line 76 at r6 (raw file):
Previously, mkornacker (Marcel Kornacker) wrote…
Can we leave this out of this? It's unclear whether we want to keep getitem for anything other than column references, because you can have non-id column names. But that doesn't apply to queries.
Sure, we don't have to resolve it in this PR. But I tend to think it'll confuse ppl if t.my_symbol
and t['my_symbol']
behave similarly, but not the same. The vast majority of users' exposure to these patterns will involve dereferencing columns, and as a result they'll expect them to be synonyms, and likely be surprised to find that's only the case for columns and not queries. (I'd be surprised.)
pixeltable/tool/create_test_db_dump.py
line 252 at r9 (raw file):
#return t.where(t.c2 < i) return t.where(t.c2 < i).select(t.c1, t.c2) add_column('query_output', t.q1(t.c2))
it'd be good to have an instance that uses similarity
too, since SimilarityExpr
can't (currently) be assigned directly to a column, and therefore isn't yet covered by the db dumps (that is, I believe the only way to get a SimilarityExpr
into the dict is through a @query
, but pls correct me if that's wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 35 files reviewed, 9 unresolved discussions (waiting on @aaron-siegel, @orm011, and @udf)
pixeltable/catalog/table.py
line 76 at r6 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
Sure, we don't have to resolve it in this PR. But I tend to think it'll confuse ppl if
t.my_symbol
andt['my_symbol']
behave similarly, but not the same. The vast majority of users' exposure to these patterns will involve dereferencing columns, and as a result they'll expect them to be synonyms, and likely be surprised to find that's only the case for columns and not queries. (I'd be surprised.)
Done.
pixeltable/tool/create_test_db_dump.py
line 252 at r9 (raw file):
Previously, aaron-siegel (Aaron Siegel) wrote…
it'd be good to have an instance that uses
similarity
too, sinceSimilarityExpr
can't (currently) be assigned directly to a column, and therefore isn't yet covered by the db dumps (that is, I believe the only way to get aSimilarityExpr
into the dict is through a@query
, but pls correct me if that's wrong).
Done.
No description provided.