Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some aggregates do not handle common edge cases correctly #6999

Open
james-whiteside opened this issue Mar 12, 2024 · 0 comments
Open

Some aggregates do not handle common edge cases correctly #6999

james-whiteside opened this issue Mar 12, 2024 · 0 comments

Comments

@james-whiteside
Copy link
Member

Description

Currently, the way common edge cases are handled by some aggregates is incorrect.

  • sum of zero elements returns NaN. Should return zero. This is particularly inconsistent as count of zero elements returns zero.
  • std of one element returns NaN. Should return zero. std of zero elements should return NaN.

Based on comment from @flyingsilverfin, it may be the case that std is using sample rather than population standard deviation. If so, this is incorrectly applied. Database queries necessarily operate under a closed-world assumption. When we ask for the standard deviation of a variable in query results, the results are the population, rather than a sample of some greater population (whatever that may be, if such a thing even exists in the context of the variable). Consider the following query.

match
$user isa user,
    has gender "male",
    has age $age;
get; std $age;

Are we asking for a) the standard deviation of ages of male users, or b) the standard deviation of ages of all human males? The answer is almost certainly (a), in which case the population is defined by the users selected by the query's constraints (in this case males). This means the population standard deviation is the applicable statistic. In fact, the sample standard deviation should only be applied when attempting to use the statistic as a predictor, and the population standard deviation is far more useful in the majority of cases. Of course, we should ideally offer both, as most mature languages do.

Environment

  1. TypeDB distribution: Core
  2. TypeDB version: 2.25.7
  3. Environment: MacOS
  4. Client and version: Studio 2.25.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants