Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: register GeoSeries as the pandas.Series.geo accessor #3272

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

tswast
Copy link

@tswast tswast commented Apr 27, 2024

This allows geopandas to be used like other extension dtypes.

import pandas
import shapely
import geopandas

series = pandas.Series([shapely.Point(1, 2), shapely.Point(3, 4), shapely.Point(5, 6)], dtype=geopandas.array.GeometryDtype())

# Edit: Removed DataFrame accessor changes from this PR.

series.geo.x
# Out[5]: 
# 0    1.0
# 1    3.0
# 2    5.0
# dtype: float64

TODO

  • Tests
  • Docs

Fixes #166
Related #680

@tswast tswast marked this pull request as ready for review April 27, 2024 17:58
@@ -75,6 +76,7 @@ def _geoseries_expanddim(data=None, *args, **kwargs):
return _expanddim_logic(df)


@pandas.api.extensions.register_series_accessor("geo")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should be registering GeoSeries as the accessor, but an accessor class (perhaps derived from GeoSeries so that all the methods exist). This seems a bit counter to the extension registration documentation above. Implemented like this, we don't conform to the documented expectation:

For consistency with pandas methods, you should raise an AttributeError if the data passed to your accessor has an incorrect dtype.

I think we should be doing this.

Also, with GeoSeries as the accessor, this would automatically coerce object dtypes, e.g.

s = pd.Series([Point(x, x) for x in range(3)])
x = s.geo.x # this works, even though s is not GeometryDtype

I would expect the accessor should only work for GeometryDtype Series.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I can look into providing a separate accessor class that delegates to GeoSeries and raises AttributeError for incorrect dtype.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 42ea6c5

@m-richards
Copy link
Member

Hi @tswast, thanks for putting this PR forward. Would you be able to share a little bit about the use case / goal you have in mind for this? I'd just like to understand this a little bit better before reviewing. The issues you have linked to are quite old, and have relatively little commentary from current maintainers of geopandas, so it'd be good to discuss what the aim is a bit before finessing the PR.

You may or may not have seen there is also #1947 and the corresponding #1952 PR which are related to this, but got a bit stalled, in part because they didn't have a super clear statement of intent.

Now I can try infer this from the documentation notebook you've added, but it would be good to hear directly from you, rather than me guessing.

@tswast
Copy link
Author

tswast commented Apr 28, 2024

#1947 is kinda the opposite of what I'm doing here. Those are about copying the implementation of the register accessor API from pandas to geopandas so folks can register geopandas-specific accessors.

This PR registers geopandas as a true extension dtype in pandas, making it possible to use geopandas consistently with other extension dtypes. In this PR, I am adding a "geo" accessor to pandas.Series and pandas.DataFrame, as proposed in #166. This builds on the work that has already been done here to make the Geometry extension dtype and extension array.

To be fully transparent, my primary project at work is BigQuery DataFrames. We currently return pandas objects with the existing geopandas extension dtype when folks download a table that contains a BigQuery GEOGRAPHY column.

My goal here is to make things consistent so that those downloaded objects can be much more useful and that if/when BigQuery DataFrames starts adding methods for working with those GEOGRAPHY columns we can do so without forcing users to convert things from DataFrame/Series to GeoDataFrame/GeoSeries.

@m-richards
Copy link
Member

m-richards commented Apr 29, 2024

To be fully transparent, my primary project at work is BigQuery DataFrames. We currently return pandas objects with the existing geopandas extension dtype when folks download a table that contains a BigQuery GEOGRAPHY column.

My goal here is to make things consistent so that those downloaded objects can be much more useful and that if/when BigQuery DataFrames starts adding methods for working with those GEOGRAPHY columns we can do so without forcing users to convert things from DataFrame/Series to GeoDataFrame/GeoSeries.

Great! - I hoped you'd share something like this. I think there certainly is a use case here of supporting geometrydtype data in regular pandas series/ dataframes - even if it's not the core usage mode of geopandas (I don't anticipate adding the geo accessor to change the api of a GeoSeries/GeoDataFrame going forward, it'd be a massive breaking change and in my view the time window to do that has been and gone). However, I think we'd have to be a bit careful about whether it actually makes sense to do this for dataframes.

As I understand it, the accessor registration means that df.geo.covers(s) will apply the geodataframe constuctor to df and then call .covers on it. That works fine in the case there is only one geometry column, and it is named "geometry", but if say I updated your test dataframe to use another column name, df.geo.covers(s) will crash.

@pytest.fixture
def df():
    return pd.DataFrame(
        {
            "geometry2": pd.Series(
                [Point(x, x) for x in range(3)], dtype=GeometryDtype()
            ),
            "value1": np.arange(3, dtype="int64"),
            "value2": np.array([1, 2, 1], dtype="int64"),
        }
    )

(now in this case you could work around it by specifying df["geometry2"].geo.covers(s) but the point holds for e.g. sjoin which has no GeoSeries analogue)

The current way of going from a pandas dataframe to a geodataframe without using the constructor is via set_geometry (which is monkey patched onto pd.DataFrame) - I understand though that may not / does not work if you're trying to preserve functionality of the bigquery dataframe library.

Do your users expect to be able to use geodataframe exclusive methods like sjoin, clip, etc?

@tswast
Copy link
Author

tswast commented Apr 29, 2024

Do your users expect to be able to use geodataframe exclusive methods like sjoin, clip, etc?

At the moment, in BigQuery DataFrames we're actually just exposing the Series accessors. I would like to support spatial joins at some point, but I'm not sure if we'd do that via an analogue to the GeoDataFrame class or not.

That works fine in the case there is only one geometry column, and it is named "geometry", but if say I updated your test dataframe to use another column name, df.geo.covers(s) will crash.

Maybe what I can do in the meantime is update that notebook to have an sjoin example, where we have to rename a column to "geometry" first?

@tswast
Copy link
Author

tswast commented Apr 29, 2024

Regarding the failing test, it's passing for me:

pytest geopandas/tests/test_geodataframe.py::TestDataFrame::test_set_geometry_col
=============================== test session starts ================================
platform darwin -- Python 3.11.2, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/swast/src/github.com/geopandas/geopandas
configfile: pyproject.toml
collected 1 item                                                                   

geopandas/tests/test_geodataframe.py .                                       [100%]

================================ 1 passed in 0.29s =================================

@martinfleis
Copy link
Member

I am personally not a big fan of this as I am mostly worried about the maintenance cost it will bring. Yes, simple cases work but in order to properly support this, we would need extensive test suite to catch all the corner cases and then ensure we squash all potential bugs that may lurk there in future. All that for very little gain.

The example with the geometry column name is one of the sort but there will undoubtedly be other. We struggle to keep up with maintenance and development and this feels like a lot of maintenance that brings near nothing.

@tswast is Google Cloud willing to support the maintenance in a long term, if you want to use this feature?

@tswast
Copy link
Author

tswast commented Apr 29, 2024

Is Google Cloud willing to support the maintenance in a long term, if you want to use this feature?

Unlikely. I made this change on my personal time over the weekend. It seemed to me to be aligned with the discussions I could find about making geopandas a proper extension dtype, but I may have misread the situation.

FWIW: https://pandas.pydata.org/docs/development/extending.html is a stable API now and has been around for a good few years (announcement blog post).

@tswast
Copy link
Author

tswast commented Apr 29, 2024

It seems to me that the GeoSeries change is less controversial. Would you like me to remove the GeoDataFrame accessor since it's more likely to cause trouble?

@tswast tswast changed the title ENH: register GeoSeries and GeoDataFrame as pandas Series and DataFrame accessors, respectively ENH: register GeoSeries as the pandas.Series.geo accessor Apr 29, 2024
@m-richards
Copy link
Member

It seems to me that the GeoSeries change is less controversial.

I think that's fair to say

Maybe what I can do in the meantime is update that notebook to have an sjoin example, where we have to rename a column to "geometry" first?

I don't think I'd support this - it's not intutive, the error is confusing and it's counter to geopandas generally trying to track this for you so this kind of thing doesn't happen - it's why we have a geodataframe subclass (historically this used to track other things like crs and sindex, but now _metadata = ["_geometry_column_name"] is the only extra metadata key we define).

Would you like me to remove the GeoDataFrame accessor since it's more likely to cause trouble?

I think in terms of trying to progress this in the short term that's a good idea. I'm a bit more optimistic than Martin in terms of potential edge cases (perhaps naively), especially for the GeoSeries case, as there should be no action at a distance - we're just attaching a way to indirectly convert GeometryDtype Series to a GeoSeries.

@tswast
Copy link
Author

tswast commented Apr 30, 2024

Would you like me to remove the GeoDataFrame accessor since it's more likely to cause trouble?

I think in terms of trying to progress this in the short term that's a good idea.

Thanks, done in the most recent 2 commits: https://github.com/geopandas/geopandas/pull/3272/files/17504e58fe3f00f680963949b80cdb8ede3e0a52..97dc3fb008a371c1d3a9f1c363173fa92b90f78b

I appreciate the feedback so far. I'm personally finding some satisfaction in making geopandas more consistent with other packages that extend pandas. Thankfully all the hard work you've all done so far made this change quite easy to make, so not much lost if not accepted in the end.

@tswast tswast requested a review from m-richards May 8, 2024 20:30
@martinfleis martinfleis added the needs discussion Requires further discussion label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion Requires further discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use geo accessor for GeoSeries methods?
3 participants