Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle dates and timedeltas in aggregators and DescribeSheet #2380

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

midichef
Copy link
Contributor

This PR is makes the current aggregators (mean, median, sum) work with dates and timedeltas. It's a continuation of the date-handling in #1996.

One notable change is that the stdev of a date is a timedelta, not a date. The benefits of this are, it formats more readably in the describe sheet (such as '200 days'). And it simplifies doing arithmetic with the result of stdev, like expressing another timedelta as a multiple of the stdev. The downside is that it breaks the assumption that an aggregator's result is the same type as the column being aggregated. That leads to some special case handling:

if agg.name == 'stdev' and (col.type is date):

And note that the stdev of a timedelta is still a timedelta.

In order to make the aggregators and the DescribeSheet handle date and timedeltas, I've had to change the type of several aggregators to anytype. Is this acceptable? Are there drawbacks?

I've submitted this as a series of small commits for easier review. If this all looks okay, let me know, I'll squash it to a smaller number of commits before deploying.

Copy link
Owner

@saulpw saulpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @midichef. I've been looking at it the past few days and I have a few comments and questions. I think you have a keen understanding of what we would want the behavior to be, and I'd like to take this opportunity to think deeply and clean up some things around aggregators, so that we don't make them a lot messier to work with and reason about.

@@ -107,13 +108,48 @@ def _funcRows(col, rows): # wrap builtins so they can have a .type
def mean(vals):
vals = list(vals)
if vals:
return float(sum(vals))/len(vals)
if type(vals[0]) is date:
vals = [d.timestamp() for d in vals]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these special cases belong in the aggregators. The idea is that people can define their own aggregators with 'obvious' implementations and they should work reasonably. (and regardless, where these checks are, they should use isinstance with the Python stdlib datetime as you're doing below).

To that end, the visidata date and datedelta classes both already have __float__ methods which convert them to float as you're doing here. So maybe the aggregator callsite can call aggregator.intype on each value, and for these "more numeric" aggregators would set their intype to float. Then they would need to convert the float result back into some result type, which as you've noted below can be tricky for date formats.

return sum(vals, start=type(vals[0] if len(vals) else 0)()) #1996
if vals:
if type(vals[0]) is date:
vd.error('dates cannot be summed')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a vd.fail as it's more of a user error. But it should also be outside the aggregation function.

if vals and len(vals) >= 2:
if type(vals[0]) is date:
vals = [d.timestamp() for d in vals]
return datetime.timedelta(seconds=statistics.stdev(vals))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting, I wonder how we can codify that the output type of stdev for both date and datedelta are datedelta. It's not quite as simple as making the result type datedelta and passing it the float from the generic aggregations, since Python's timedelta (which datedelta is a subclass of) has a constructor with a first parameter of 'days' instead of 'seconds'.

vd.aggregator('avg', mean, 'arithmetic mean of values', type=float)
vd.aggregator('mean', mean, 'arithmetic mean of values', type=float)
vd.aggregator('median', statistics.median, 'median of values')
vd.aggregator('min', min, 'minimum value', type=anytype)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are min/max forcibly set to anytype instead of letting them take the type of their source column?

I understand why avg/mean/median want to be more generic; maybe this is where we have both an 'input type' which remains float for these, and a 'result type' which is more variable.

for agg_choice in agg_choices:
agg = vd.aggregators.get(agg_choice)
if not agg: continue
aggs = agg if isinstance(agg, list) else [agg]
for agg in aggs:
aggval = agg(col, rows)
typedval = wrapply(agg.type or col.type, aggval)
dispval = col.format(typedval)
if agg.name == 'stdev' and (col.type is date):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out how to not have a very special type check here.

@@ -44,10 +45,6 @@ class DescribeSheet(ColumnsSheet):
DescribeColumn('nulls', type=vlen),
DescribeColumn('distinct',type=vlen),
DescribeColumn('mode', type=str),
DescribeColumn('min', type=str),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any functional reason to moving these into describe_aggrs? I think this must be why min/max types were changed to anytype (and maybe they should not be str here in the original code), so maybe this is not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants