-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this!
geopandas/io/arrow.py
Outdated
"xmin": [bbox_encoding_fieldname, "xmin"], | ||
"ymin": [bbox_encoding_fieldname, "ymin"], | ||
"xmax": [bbox_encoding_fieldname, "xmax"], | ||
"ymax": [bbox_encoding_fieldname, "ymax"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you use this for testing, but I think my preference would be to make the test code a bit longer, while keeping it here simpler (only what is needed for the actual code)
Or at least give it a try to see if it would not be too difficult to do this in the tests. I assume you could do something similar as now in the tests, but after creating the table, rename the bbox column, get the metadata and edit this, and replace the metadata.
Something roughly like:
table = _geopandas_to_arrow(
df,
schema_version="1.1.0",
write_covering_bbox=True,
)
table = table.rename_columns([...]) # this needs to be hardcoded list of names, so maybe easiest to use dummy dataframe instead of naturalearth
metadata = table.schema.metadata
geo_metadata = json.loads(metadata[b"geo"])
# edit metadata ...
metadata[b"geo"] = ...
table = table.replace_schema_metadata(metadata)
pq.write_table(table, filename)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that steer, that direction is very helpful to get around the hurdle. I have updated the test accordingly and reverted changes to adding the kwarg
geopandas/io/tests/test_arrow.py
Outdated
df = df.assign(bbox=[0] * len(df)) | ||
filename = os.path.join(str(tmpdir), "test.pq") | ||
|
||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a match=..
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
It's true that you might already have this data available, but personally I would leave that until later in case there is actually user request for this. In general calculating bounds is not that expensive. |
I did a quick test using the |
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good now! Will just do a merge of main to get CI green (and parametrized the existing basic read test with the geometry column name)
Thanks @nicholas-ys-tan! |
Resolves #3308
This is a continuation of #3282 to address 3 items:
- ensure this is robust for the geometry column name (no hardcoded "geometry")
The was an implicit hardcoding that occurred when checking of the covering encoding was in the metadata, and when seeking the bbox encoding field name. My approach assumes that there will only be one entry for the geo_metadata columns, which I assume to be a valid assumption as there can only be one active geometry.
- test reading a file that uses a different column name as "bbox"
The code as it is addresses this by looking at the metadata to discover the field name. To facilitate testing of this, some refactoring was done to add the field name as a kwarg in the private functions. This still maintains "bbox" as the only allowable fieldname when writing, but this refactor facilitated writing the tests. Hope that is acceptable.
- test writing in case your geodataframe already has a bbox column
When testing, no error showed when writing the parquet file, but an error was raised when reading - it was not a very descriptive error but seems to be attributed to two fields having identical names. A ValueError has been added if the dataframe already has a column with the name "bbox" and the user has
write_bbox_covering=True
.One thing that I wanted to confirm is that we don't want the user to be able to specify their own
bbox
column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason - in which case I should be parsing theirbbox
column to ensure formatting is appropriate and pushing that in, instead of calculating it for the user?