TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

nicholas-ys-tan · 2024-06-01T08:22:36Z

Resolves #3308

This is a continuation of #3282 to address 3 items:

- ensure this is robust for the geometry column name (no hardcoded "geometry")
The was an implicit hardcoding that occurred when checking of the covering encoding was in the metadata, and when seeking the bbox encoding field name. My approach assumes that there will only be one entry for the geo_metadata columns, which I assume to be a valid assumption as there can only be one active geometry.

- test reading a file that uses a different column name as "bbox"
The code as it is addresses this by looking at the metadata to discover the field name. To facilitate testing of this, some refactoring was done to add the field name as a kwarg in the private functions. This still maintains "bbox" as the only allowable fieldname when writing, but this refactor facilitated writing the tests. Hope that is acceptable.

- test writing in case your geodataframe already has a bbox column
When testing, no error showed when writing the parquet file, but an error was raised when reading - it was not a very descriptive error but seems to be attributed to two fields having identical names. A ValueError has been added if the dataframe already has a column with the name "bbox" and the user has write_bbox_covering=True.

One thing that I wanted to confirm is that we don't want the user to be able to specify their own bbox column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason - in which case I should be parsing their bbox column to ensure formatting is appropriate and pushing that in, instead of calculating it for the user?

jorisvandenbossche

Thanks for taking a look at this!

geopandas/io/arrow.py

jorisvandenbossche · 2024-06-01T09:15:48Z

geopandas/io/arrow.py

+ "xmin": [bbox_encoding_fieldname, "xmin"],
+ "ymin": [bbox_encoding_fieldname, "ymin"],
+ "xmax": [bbox_encoding_fieldname, "xmax"],
+ "ymax": [bbox_encoding_fieldname, "ymax"],


I know you use this for testing, but I think my preference would be to make the test code a bit longer, while keeping it here simpler (only what is needed for the actual code)

Or at least give it a try to see if it would not be too difficult to do this in the tests. I assume you could do something similar as now in the tests, but after creating the table, rename the bbox column, get the metadata and edit this, and replace the metadata.

Something roughly like:

table = _geopandas_to_arrow( df, schema_version="1.1.0", write_covering_bbox=True, ) table = table.rename_columns([...]) # this needs to be hardcoded list of names, so maybe easiest to use dummy dataframe instead of naturalearth metadata = table.schema.metadata geo_metadata = json.loads(metadata[b"geo"]) # edit metadata ... metadata[b"geo"] = ... table = table.replace_schema_metadata(metadata) pq.write_table(table, filename)

Thanks for that steer, that direction is very helpful to get around the hurdle. I have updated the test accordingly and reverted changes to adding the kwarg

geopandas/io/arrow.py

jorisvandenbossche · 2024-06-01T09:18:29Z

geopandas/io/tests/test_arrow.py

+ df = df.assign(bbox=[0] * len(df))
+ filename = os.path.join(str(tmpdir), "test.pq")
+
+ with pytest.raises(ValueError):


Can you add a match=.. here?

jorisvandenbossche · 2024-06-01T09:24:17Z

One thing that I wanted to confirm is that we don't want the user to be able to specify their own bbox column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason

It's true that you might already have this data available, but personally I would leave that until later in case there is actually user request for this. In general calculating bounds is not that expensive.

jorisvandenbossche · 2024-06-01T09:27:29Z

In general calculating bounds is not that expensive.

I did a quick test using the nz-building-outlines.gpkg file (30 million polygons, 1.2 GB gpkg file), and writing that file with write_covering_bbox=True enabled. For this case, calculating the bounds for the bbox column takes around 2% of the to_parquet time.

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche

Thanks, looks good now! Will just do a merge of main to get CI green (and parametrized the existing basic read test with the geometry column name)

jorisvandenbossche · 2024-06-03T10:19:57Z

Thanks @nicholas-ys-tan!

nicholas-ys-tan added 4 commits June 1, 2024 17:35

TST: add testing coverage for custom bbox encoding field names

be40557

ENH: fix requirement of 'geometry' field name to filter parquet by bbox

948229c

ENH: disallow write_covering_bbox if 'bbox' column name is already used

2e302a6

remove redundant import

ee1989f

jorisvandenbossche reviewed Jun 1, 2024

View reviewed changes

nicholas-ys-tan and others added 4 commits June 1, 2024 21:17

fix to allow alternate geometry column names

97e30ee

Co-authored-by: Joris Van den Bossche <[email protected]>

formatting of error message

2c776e4

Co-authored-by: Joris Van den Bossche <[email protected]>

TST: rewrite test for custom bbox encoding field name

ddfa228

Merge branch 'main' into issue3308

b3cf1a3

nicholas-ys-tan marked this pull request as ready for review June 1, 2024 12:36

update get bbox encoding field name method

7995558

jorisvandenbossche changed the title ~~TST/ENH: Continuation of #3282, add testing coverage and fixes~~ TST/ENH: more robust handling of column names in GeoParquet bbox covering support Jun 3, 2024

jorisvandenbossche added 2 commits June 3, 2024 11:32

Merge remote-tracking branch 'upstream/main' into issue3308

6a42b84

parametrize test

1b952a3

jorisvandenbossche approved these changes Jun 3, 2024

View reviewed changes

fixup

fe6f2aa

jorisvandenbossche merged commit d712529 into geopandas:main Jun 3, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

nicholas-ys-tan commented Jun 1, 2024

jorisvandenbossche left a comment

jorisvandenbossche Jun 1, 2024

nicholas-ys-tan Jun 1, 2024

jorisvandenbossche Jun 1, 2024

nicholas-ys-tan Jun 1, 2024

jorisvandenbossche commented Jun 1, 2024

jorisvandenbossche commented Jun 1, 2024

jorisvandenbossche left a comment •

edited

jorisvandenbossche commented Jun 3, 2024

TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

Conversation

nicholas-ys-tan commented Jun 1, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jun 1, 2024

Choose a reason for hiding this comment

nicholas-ys-tan Jun 1, 2024

Choose a reason for hiding this comment

jorisvandenbossche Jun 1, 2024

Choose a reason for hiding this comment

nicholas-ys-tan Jun 1, 2024

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 1, 2024

jorisvandenbossche commented Jun 1, 2024

jorisvandenbossche left a comment • edited

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 3, 2024

jorisvandenbossche left a comment •

edited