Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naming rules for join keys #78

Open
smmaurer opened this issue Dec 14, 2018 · 0 comments
Open

Naming rules for join keys #78

smmaurer opened this issue Dec 14, 2018 · 0 comments

Comments

@smmaurer
Copy link
Member

smmaurer commented Dec 14, 2018

It occurred to me that if we follow some simple naming rules for join keys, we can substantially improve usability and data validation.. cc @janowicz @mxndrwgrdnr

This idea is related to issue #67 in that it's also about column names, but they're pretty separate.

Rules

  1. Each table has a primary key/ index of one or more columns (already true)
  2. Foreign keys have the same name as the primary key they're associated with (already true 95% of the time)
  3. Columns cannot have the same name as another table's primary key unless they're meant to be associated with it (hopefully already true)

Advantages

If we follow these rules, we don't need "broadcasts". Join relationships are known in advance from the column names. This is easier for users and avoids bugs associated with bad broadcast definitions.

It also allows us to validate table relationships at any time. I've been reluctant to validate broadcasts this way, because sometimes they're provided in advance but not meant to be used until later in a simulation when source tables are present.

Tricky cases

Should work fine for multi-column keys, which is a nice bonus because Orca broadcasts don't support them. (ChoiceModels implements interaction term merges this way.)

Sometimes tables have the same primary key as each other, one with a subset of the id's (e.g. master list of nodes and a smaller list representing a transit network). I don't see any problems supporting this as long as we're expecting it.

I only see one place in the current cloud platform data spec that violates these rules: building parcel_id maps to parcel primary_id.

Implementation

It would be helpful to implement support for auto-specified merges at the same time as the data loading (issue #66). Two possible approaches:

a. Templates automatically generate Orca broadcasts? I suspect this would be tricky, because Orca doesn't allow over-determined broadcasts. (If a is linked to b and c, and b is also linked to c, you can't orca-merge the three of them. Not sure if this is a bug or intentional.)

b. Templates first try Orca merge, and if the broadcasts aren't there it falls back to its own merge logic. Once it's working smoothly we can add it to Orca.

Diagram

column-names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant