Strategy for dealing with column names that overlap between tables #67

smmaurer · 2018-12-06T00:08:38Z

We should shave some strategies and policies for dealing with column names that exist in multiple tables -- both so we can advise users about what's expected, and so our code handles things the same way in different places. This is prompted by conversations with @mxndrwgrdnr and @janowicz.

Problem

The same column name can exist in multiple tables, with values that are either aligned (e.g. building_id) or totally different (e.g. area).

When tables with name conflicts are merged together, Pandas either raises an error or appends suffixes to the names, depending on how the merge/join is specified. This can cause UrbanSim to crash or potentially to generate ambiguous or incorrect results.

What are the use cases where duplicate column names are justified?

a. Primary/foreign keys: the buildings table has a building_id, and the households table also has a building_id because households are matched to buildings.

b. Metrics that exist at multiple scales: there could be a tract-level population and a zone-level population, and it might not be convenient to give the columns globally unique names.

c. Temporary data re-indexing: if zone-level attributes are needed for many different building-level calculations, it could be computationally efficient to copy them onto the buildings table temporarily, to avoid repeated merges.

Existing strategy

It seems like our existing strategy is to mildly discourage overlapping column names without doing much to explicitly check for them. Users need to work out the details if they cause problems.

Orca's table merging will either (1) silently append suffixes, or (2) silently drop all but the left-most occurrence of the duplicate column.

Proposed strategy

The ambiguity and flux of auto-dropping columns or changing column names in merged tables seems bad to me.

My proposal: In situations where columns are automatically assembled from multiple tables, output column names other than the join keys must be unique. I think this will help make simulations more reliable and help identify potential data problems more proactively.

Example 1: The model expression for a move-out model is move_choice ~ age + tenure, drawing data from the households and units tables. If tenure accidentally appears in both tables, the template will raise an error. If some other variable like building_id appears in both tables, it's no problem because this is not part of the model expression.

Example 2: Choosers (households) and alternatives (workplaces) both have a column named tract_id. If tract_id appears in the model expression, or as a join index for interaction terms, choicemodels will raise an error. Since we probably want to retain both id's in the merged data, a better practice is to store them with different names: household_tract_id and workplace_tract_id. This is annoying, but I think it's better than silently resolving name overlaps.

Other approaches to consider

Universally disallow duplicate column names except for indexes?

This seems too strict. It would prevent use cases (b) and (c) above.
Be more strict about how we specify column names, using some kind of table_name.column_name syntax in model expressions so that every column is unique?

I find this appealing: it's clear, unambiguous, and would align well with database syntax. But referring to column names on their own is pretty fundamental to the Orca API, so it might not be something we want to change. (I toyed with this syntax in orca_test, but ended up only using it for specifying foreign keys.)

Questions

Are there use cases where my proposed strategy will cause problems?

The text was updated successfully, but these errors were encountered:

mxndrwgrdnr · 2018-12-06T04:19:20Z

Although I thought I had solved my issue with the approach you've documented here as Example 2, it cropped up again, and I'm not sure that your proposed strategy handles this case.

As it pertains to generating MergedChoiceTable objects, the issue I am encountering is that often times the "observations" table and the "alternatives" table are both specified using orca.merge_tables() calls that include the same intermediate tables. For example, in the case of my WLCM, my choosers table is constructed from this list of tables: ['persons', 'households', 'units', 'buildings', 'parcels'], while my alternatives table is constructed from this one: ['jobs', 'buildings', 'parcels', 'nodeswalk', 'nodessmall']. So obviously there are a lot of columns in common between the two resulting tables, and going the route of Example 2, I don't think its really feasible to rename all of them. The approach in Example 1, though it works in theory, is insufficient here as well because obs and alts tables that get passed in the merged choice table constructor have not been filtered to include only columns specified in the model expression. Rather they contain the full tables generated by the original merges I describe above. You can see what I mean here.

I think there are two solutions to this use case: 1) update the large mnl code so that the obs and alts tables that get passed in to the merged choice table constructor are pared down to only the columns that appear in the model expression; or 2) replace the joins in these lines in the MergedChoiceTable code with merges. I totally understand the rationale for sticking with merges and not wanted to involuntarily create new and arbitrary column names, thats why I originally raised the issue with you. However, after thinking about it some more, I'm fairly certain that as long as we're only using merges within the MergedChoiceTable class itself, which is more or less a temporary table only ever stored in memory, there really shouldn't be any downstream effects. Moreover, none of the columns that are actually needed by the model would be effected by the auto-generation of suffixes by pd.merge() because those columns must necessarily only appear once in the merged choice table in order to be ingested properly by patsy.

I think we could go either way. What do you think?

smmaurer · 2018-12-06T18:36:25Z

@mxndrwgrdnr I like approach 1, where the data passed to the MCT constructor is pared down to only include necessary columns. We should be doing this anyway because it's more efficient. (I think i had put it off just because it was more complicated to code than the equivalent for OLS.) I can work on this today if it's something that's blocking your progress.

It also seems reasonable to add a parameter to the MCT constructor so that people who are using choicemodels directly can specify what they want to happen with overlapping columns: raise an error, drop some, or append suffixes. And we just wouldn't use that functionality from the UrbanSim side.

mxndrwgrdnr · 2018-12-06T19:20:31Z

That makes sense. And yeah, it should hopefully results in faster run times. Anyways, there's nothing currently blocking my progress because I have the joins replaced with merges in my local branch (I will just refrain from committing my changes and do a rebase once you've updated master) but it will block progress once we've assembled all these new simulation steps in the activitysynth repo and have multiple folks trying to run it. I make the changes to if you have more pressing stuff on your plate at the moment.

smmaurer · 2018-12-10T19:17:56Z

I've been working on some code to verify ex-ante that a list of columns can be drawn unambiguously from a list of auto-merged tables. Snapshot in a branch of utils.py.

It turns out to be hard to do this well within the current system of how Orca manages tables, columns, and join keys ("broadcasts"). Basically, we want to check whether column names are unique across a set of tables, making an exception for join keys that will collapse into a single column -- but NOT making an exception for keys that won't be used in merging this set of tables, since that's one of the problems we're trying to catch.

Challenge: index names

The first problem, which is annoying but surmountable, is that the Orca API treats index and non-index columns completely separately. You can't fetch index columns by name. To get ALL the column names, you need to specify DataFrameWrapper.columns + DataFrameWrapper.index.names. A broadcast involving an index is stored as cast_index=True instead of cast_on='colname'. So it takes a lot of extra code to write algorithms that treat all columns by name.

I think the reason Orca was designed this way is that the Pandas API used to do the same thing. But Pandas is moving beyond this -- v0.23 and later let you refer to index columns by name for tasks like join() and merge(). This enables much clearer code, especially when you have multi-indexes, and I'd advocate that we update Orca to refer to index columns by name too.

What matters in our broader data model is things like whether column names are unique, whether key values are unique, whether a primary-key-foreign-key relationship is valid. Pandas indexes don't affect the substantive output, they're just an implementation detail that speeds up certain operations -- so better to handle the distinction internally wherever possible.

Challenge: broadcast validation

A bigger problem -- at least for validating merges ex-ante -- is that any algorithm for checking column names relies on the broadcasts themselves being valid, and I don't think there's an easy way to confirm this before performing the merge. Orca won't stop you from specifying bad broadcasts, you'll just get a runtime error if they can't be resolved properly.

So it would be nice to implement broadcast validation too, but this probably requires changes to Orca -- it would be a bigger project and require a lot of testing.

Next steps

Here's my inclination:

For Orca auto-merges, we won't implement any additional validation for now. (This is cases like providing a list of estimation tables that will automatically be merged together.) This can be part of a bigger project updating Orca and improving the orca_test data validation utilities.
For MergedChoiceTable merges, we'll implement the fixes that @mxndrwgrdnr needs, most importantly paring down the tables to only include necessary columns before merging them.

smmaurer · 2018-12-12T18:18:52Z

Ok, we've taken care of the top-priority updates.

Templates PR #74 pares down chooser and alternative tables to only include necessary columns before merging them. ChoiceModels PR #54 adds a check to MergedChoiceTable so that users get a clearer error if there are duplicate column names.

I'm leaving this issue open for general discussion, and because there are other updates we might want to make in the future: MergedChoiceTable support for dropping or renaming columns in the case of name conflicts, broadcast validation before runtime, etc.

smmaurer · 2018-12-13T22:07:52Z

Two notes:

This comment from a separate issue proposes a data management strategy for models where we need a chooser copy and an alternative copy of the same variable: Integrate changes from ej-test branch #75 (comment)
Realized that the way choicemodels.MergedChoiceTable handles interaction terms is a nice illustration of the advantages of stricter naming rules. Interaction terms are required to have two indexes, one whose name matches a column of the choosers table and the other whose name matches a column of the alternatives table.

This means we don't need any kind of "broadcast" defining the merge relationship -- it's right in the column names. It would be pretty easy to do this for regular table merges as well. The vast majority of broadcasts already follow this pattern anyway -- index matched to a column that shares its name. mergedchoicetable.py#L84-L89

This was referenced Dec 11, 2018

[0.1.dev21] Beter data loading utilities #74

Merged

[0.2.dev7] MergedChoiceTable check for duplicate column names UDST/choicemodels#54

Merged

smmaurer mentioned this issue Dec 14, 2018

Naming rules for join keys #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategy for dealing with column names that overlap between tables #67

Strategy for dealing with column names that overlap between tables #67

smmaurer commented Dec 6, 2018 •

edited

Loading

mxndrwgrdnr commented Dec 6, 2018

smmaurer commented Dec 6, 2018

mxndrwgrdnr commented Dec 6, 2018

smmaurer commented Dec 10, 2018 •

edited

Loading

smmaurer commented Dec 12, 2018

smmaurer commented Dec 13, 2018

Strategy for dealing with column names that overlap between tables #67

Strategy for dealing with column names that overlap between tables #67

Comments

smmaurer commented Dec 6, 2018 • edited Loading

Problem

What are the use cases where duplicate column names are justified?

Existing strategy

Proposed strategy

Other approaches to consider

Questions

mxndrwgrdnr commented Dec 6, 2018

smmaurer commented Dec 6, 2018

mxndrwgrdnr commented Dec 6, 2018

smmaurer commented Dec 10, 2018 • edited Loading

Challenge: index names

Challenge: broadcast validation

Next steps

smmaurer commented Dec 12, 2018

smmaurer commented Dec 13, 2018

smmaurer commented Dec 6, 2018 •

edited

Loading

smmaurer commented Dec 10, 2018 •

edited

Loading