Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

Merged
merged 12 commits into from
Jun 3, 2024

Conversation

nicholas-ys-tan
Copy link
Contributor

Resolves #3308

This is a continuation of #3282 to address 3 items:

- ensure this is robust for the geometry column name (no hardcoded "geometry")
The was an implicit hardcoding that occurred when checking of the covering encoding was in the metadata, and when seeking the bbox encoding field name. My approach assumes that there will only be one entry for the geo_metadata columns, which I assume to be a valid assumption as there can only be one active geometry.

- test reading a file that uses a different column name as "bbox"
The code as it is addresses this by looking at the metadata to discover the field name. To facilitate testing of this, some refactoring was done to add the field name as a kwarg in the private functions. This still maintains "bbox" as the only allowable fieldname when writing, but this refactor facilitated writing the tests. Hope that is acceptable.

- test writing in case your geodataframe already has a bbox column
When testing, no error showed when writing the parquet file, but an error was raised when reading - it was not a very descriptive error but seems to be attributed to two fields having identical names. A ValueError has been added if the dataframe already has a column with the name "bbox" and the user has write_bbox_covering=True.

One thing that I wanted to confirm is that we don't want the user to be able to specify their own bbox column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason - in which case I should be parsing their bbox column to ensure formatting is appropriate and pushing that in, instead of calculating it for the user?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this!

geopandas/io/arrow.py Outdated Show resolved Hide resolved
"xmin": [bbox_encoding_fieldname, "xmin"],
"ymin": [bbox_encoding_fieldname, "ymin"],
"xmax": [bbox_encoding_fieldname, "xmax"],
"ymax": [bbox_encoding_fieldname, "ymax"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you use this for testing, but I think my preference would be to make the test code a bit longer, while keeping it here simpler (only what is needed for the actual code)

Or at least give it a try to see if it would not be too difficult to do this in the tests. I assume you could do something similar as now in the tests, but after creating the table, rename the bbox column, get the metadata and edit this, and replace the metadata.

Something roughly like:

    table = _geopandas_to_arrow(
        df,
        schema_version="1.1.0",
        write_covering_bbox=True,
    )
    table = table.rename_columns([...])  # this needs to be hardcoded list of names, so maybe easiest to use dummy dataframe instead of naturalearth
    metadata = table.schema.metadata
    geo_metadata =  json.loads(metadata[b"geo"])
    # edit metadata ...
    metadata[b"geo"] = ...
    table = table.replace_schema_metadata(metadata)
    pq.write_table(table, filename)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that steer, that direction is very helpful to get around the hurdle. I have updated the test accordingly and reverted changes to adding the kwarg

geopandas/io/arrow.py Outdated Show resolved Hide resolved
df = df.assign(bbox=[0] * len(df))
filename = os.path.join(str(tmpdir), "test.pq")

with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a match=.. here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@jorisvandenbossche
Copy link
Member

One thing that I wanted to confirm is that we don't want the user to be able to specify their own bbox column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason

It's true that you might already have this data available, but personally I would leave that until later in case there is actually user request for this. In general calculating bounds is not that expensive.

@jorisvandenbossche
Copy link
Member

In general calculating bounds is not that expensive.

I did a quick test using the nz-building-outlines.gpkg file (30 million polygons, 1.2 GB gpkg file), and writing that file with write_covering_bbox=True enabled. For this case, calculating the bounds for the bbox column takes around 2% of the to_parquet time.

@nicholas-ys-tan nicholas-ys-tan marked this pull request as ready for review June 1, 2024 12:36
@jorisvandenbossche jorisvandenbossche changed the title TST/ENH: Continuation of #3282, add testing coverage and fixes TST/ENH: more robust handling of column names in GeoParquet bbox covering support Jun 3, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good now! Will just do a merge of main to get CI green (and parametrized the existing basic read test with the geometry column name)

@jorisvandenbossche jorisvandenbossche merged commit d712529 into geopandas:main Jun 3, 2024
19 of 20 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @nicholas-ys-tan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TST: increase testing coverage of read_parquet for alternate covering bounding box field names
2 participants