ENH: register `GeoSeries` as the `pandas.Series.geo` accessor #3272

tswast · 2024-04-27T17:36:39Z

This allows geopandas to be used like other extension dtypes.

import pandas
import shapely
import geopandas

series = pandas.Series([shapely.Point(1, 2), shapely.Point(3, 4), shapely.Point(5, 6)], dtype=geopandas.array.GeometryDtype())

# Edit: Removed DataFrame accessor changes from this PR.

series.geo.x
# Out[5]: 
# 0    1.0
# 1    3.0
# 2    5.0
# dtype: float64

TODO

Tests
Docs

Fixes #166
Related #680

…me accessors, respectively

tswast · 2024-04-27T18:05:07Z

geopandas/geoseries.py

@@ -75,6 +76,7 @@ def _geoseries_expanddim(data=None, *args, **kwargs):
 return _expanddim_logic(df)


+@pandas.api.extensions.register_series_accessor("geo")


This is publicly documented here: https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.register_series_accessor.html#pandas.api.extensions.register_series_accessor

I don't think you should be registering GeoSeries as the accessor, but an accessor class (perhaps derived from GeoSeries so that all the methods exist). This seems a bit counter to the extension registration documentation above. Implemented like this, we don't conform to the documented expectation:

For consistency with pandas methods, you should raise an AttributeError if the data passed to your accessor has an incorrect dtype.

I think we should be doing this.

Also, with GeoSeries as the accessor, this would automatically coerce object dtypes, e.g.

s = pd.Series([Point(x, x) for x in range(3)]) x = s.geo.x # this works, even though s is not GeometryDtype

I would expect the accessor should only work for GeometryDtype Series.

Makes sense. I can look into providing a separate accessor class that delegates to GeoSeries and raises AttributeError for incorrect dtype.

Done in 42ea6c5

m-richards · 2024-04-28T01:19:24Z

Hi @tswast, thanks for putting this PR forward. Would you be able to share a little bit about the use case / goal you have in mind for this? I'd just like to understand this a little bit better before reviewing. The issues you have linked to are quite old, and have relatively little commentary from current maintainers of geopandas, so it'd be good to discuss what the aim is a bit before finessing the PR.

You may or may not have seen there is also #1947 and the corresponding #1952 PR which are related to this, but got a bit stalled, in part because they didn't have a super clear statement of intent.

Now I can try infer this from the documentation notebook you've added, but it would be good to hear directly from you, rather than me guessing.

tswast · 2024-04-28T14:43:25Z

#1947 is kinda the opposite of what I'm doing here. Those are about copying the implementation of the register accessor API from pandas to geopandas so folks can register geopandas-specific accessors.

This PR registers geopandas as a true extension dtype in pandas, making it possible to use geopandas consistently with other extension dtypes. In this PR, I am adding a "geo" accessor to pandas.Series and pandas.DataFrame, as proposed in #166. This builds on the work that has already been done here to make the Geometry extension dtype and extension array.

To be fully transparent, my primary project at work is BigQuery DataFrames. We currently return pandas objects with the existing geopandas extension dtype when folks download a table that contains a BigQuery GEOGRAPHY column.

My goal here is to make things consistent so that those downloaded objects can be much more useful and that if/when BigQuery DataFrames starts adding methods for working with those GEOGRAPHY columns we can do so without forcing users to convert things from DataFrame/Series to GeoDataFrame/GeoSeries.

m-richards · 2024-04-29T11:14:29Z

To be fully transparent, my primary project at work is BigQuery DataFrames. We currently return pandas objects with the existing geopandas extension dtype when folks download a table that contains a BigQuery GEOGRAPHY column.

My goal here is to make things consistent so that those downloaded objects can be much more useful and that if/when BigQuery DataFrames starts adding methods for working with those GEOGRAPHY columns we can do so without forcing users to convert things from DataFrame/Series to GeoDataFrame/GeoSeries.

Great! - I hoped you'd share something like this. I think there certainly is a use case here of supporting geometrydtype data in regular pandas series/ dataframes - even if it's not the core usage mode of geopandas (I don't anticipate adding the geo accessor to change the api of a GeoSeries/GeoDataFrame going forward, it'd be a massive breaking change and in my view the time window to do that has been and gone). However, I think we'd have to be a bit careful about whether it actually makes sense to do this for dataframes.

As I understand it, the accessor registration means that df.geo.covers(s) will apply the geodataframe constuctor to df and then call .covers on it. That works fine in the case there is only one geometry column, and it is named "geometry", but if say I updated your test dataframe to use another column name, df.geo.covers(s) will crash.

@pytest.fixture
def df():
    return pd.DataFrame(
        {
            "geometry2": pd.Series(
                [Point(x, x) for x in range(3)], dtype=GeometryDtype()
            ),
            "value1": np.arange(3, dtype="int64"),
            "value2": np.array([1, 2, 1], dtype="int64"),
        }
    )

(now in this case you could work around it by specifying df["geometry2"].geo.covers(s) but the point holds for e.g. sjoin which has no GeoSeries analogue)

The current way of going from a pandas dataframe to a geodataframe without using the constructor is via set_geometry (which is monkey patched onto pd.DataFrame) - I understand though that may not / does not work if you're trying to preserve functionality of the bigquery dataframe library.

Do your users expect to be able to use geodataframe exclusive methods like sjoin, clip, etc?

tswast · 2024-04-29T13:44:11Z

Do your users expect to be able to use geodataframe exclusive methods like sjoin, clip, etc?

At the moment, in BigQuery DataFrames we're actually just exposing the Series accessors. I would like to support spatial joins at some point, but I'm not sure if we'd do that via an analogue to the GeoDataFrame class or not.

That works fine in the case there is only one geometry column, and it is named "geometry", but if say I updated your test dataframe to use another column name, df.geo.covers(s) will crash.

Maybe what I can do in the meantime is update that notebook to have an sjoin example, where we have to rename a column to "geometry" first?

tswast · 2024-04-29T15:31:00Z

Regarding the failing test, it's passing for me:

pytest geopandas/tests/test_geodataframe.py::TestDataFrame::test_set_geometry_col
=============================== test session starts ================================
platform darwin -- Python 3.11.2, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/swast/src/github.com/geopandas/geopandas
configfile: pyproject.toml
collected 1 item                                                                   

geopandas/tests/test_geodataframe.py .                                       [100%]

================================ 1 passed in 0.29s =================================

martinfleis · 2024-04-29T19:30:17Z

I am personally not a big fan of this as I am mostly worried about the maintenance cost it will bring. Yes, simple cases work but in order to properly support this, we would need extensive test suite to catch all the corner cases and then ensure we squash all potential bugs that may lurk there in future. All that for very little gain.

The example with the geometry column name is one of the sort but there will undoubtedly be other. We struggle to keep up with maintenance and development and this feels like a lot of maintenance that brings near nothing.

@tswast is Google Cloud willing to support the maintenance in a long term, if you want to use this feature?

tswast · 2024-04-29T20:26:56Z

Is Google Cloud willing to support the maintenance in a long term, if you want to use this feature?

Unlikely. I made this change on my personal time over the weekend. It seemed to me to be aligned with the discussions I could find about making geopandas a proper extension dtype, but I may have misread the situation.

FWIW: https://pandas.pydata.org/docs/development/extending.html is a stable API now and has been around for a good few years (announcement blog post).

tswast · 2024-04-29T20:33:24Z

It seems to me that the GeoSeries change is less controversial. Would you like me to remove the GeoDataFrame accessor since it's more likely to cause trouble?

m-richards · 2024-04-30T10:56:10Z

It seems to me that the GeoSeries change is less controversial.

I think that's fair to say

Maybe what I can do in the meantime is update that notebook to have an sjoin example, where we have to rename a column to "geometry" first?

I don't think I'd support this - it's not intutive, the error is confusing and it's counter to geopandas generally trying to track this for you so this kind of thing doesn't happen - it's why we have a geodataframe subclass (historically this used to track other things like crs and sindex, but now _metadata = ["_geometry_column_name"] is the only extra metadata key we define).

Would you like me to remove the GeoDataFrame accessor since it's more likely to cause trouble?

I think in terms of trying to progress this in the short term that's a good idea. I'm a bit more optimistic than Martin in terms of potential edge cases (perhaps naively), especially for the GeoSeries case, as there should be no action at a distance - we're just attaching a way to indirectly convert GeometryDtype Series to a GeoSeries.

tswast · 2024-04-30T19:27:29Z

Would you like me to remove the GeoDataFrame accessor since it's more likely to cause trouble?

I think in terms of trying to progress this in the short term that's a good idea.

Thanks, done in the most recent 2 commits: https://github.com/geopandas/geopandas/pull/3272/files/17504e58fe3f00f680963949b80cdb8ede3e0a52..97dc3fb008a371c1d3a9f1c363173fa92b90f78b

I appreciate the feedback so far. I'm personally finding some satisfaction in making geopandas more consistent with other packages that extend pandas. Thankfully all the hard work you've all done so far made this change quite easy to make, so not much lost if not accepted in the end.

tswast added 3 commits April 27, 2024 12:29

ENH: register GeoSeries and GeoDataFrame as pandas Series and DataFra…

933a3c2

…me accessors, respectively

add tests

3593dc7

add sample notebook

c428156

tswast marked this pull request as ready for review April 27, 2024 17:58

lint

d0004c3

tswast commented Apr 27, 2024

View reviewed changes

tswast added 3 commits April 29, 2024 08:54

clarify that DataFrame.geo requires a column named geometry

cbb37f2

Merge remote-tracking branch 'upstream/main'

3f5b769

remove empty cell

17504e5

tswast added 2 commits April 29, 2024 16:26

revert changes to GeoDataFrame

306d425

revert more

97dc3fb

tswast changed the title ~~ENH: register GeoSeries and GeoDataFrame as pandas Series and DataFrame accessors, respectively~~ ENH: register GeoSeries as the pandas.Series.geo accessor Apr 29, 2024

tswast requested a review from m-richards May 8, 2024 20:30

martinfleis added the needs discussion Requires further discussion label May 8, 2024

tswast and others added 6 commits May 12, 2024 15:01

Merge remote-tracking branch 'upstream/main'

52ded0e

use a separate accessor class to proxy methods to GeoSeries

42ea6c5

lint error

3e8376e

use specific code for unused import

0edab76

clear whitespace

d2e8a61

more whitespace

1fa3d01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: register `GeoSeries` as the `pandas.Series.geo` accessor #3272

ENH: register `GeoSeries` as the `pandas.Series.geo` accessor #3272

tswast commented Apr 27, 2024 •

edited

tswast Apr 27, 2024

m-richards Apr 30, 2024

tswast May 8, 2024

tswast May 12, 2024

m-richards commented Apr 28, 2024

tswast commented Apr 28, 2024

m-richards commented Apr 29, 2024 •

edited

tswast commented Apr 29, 2024

tswast commented Apr 29, 2024

martinfleis commented Apr 29, 2024

tswast commented Apr 29, 2024

tswast commented Apr 29, 2024

m-richards commented Apr 30, 2024

tswast commented Apr 30, 2024

		@@ -75,6 +76,7 @@ def _geoseries_expanddim(data=None, args, *kwargs):
		return _expanddim_logic(df)


		@pandas.api.extensions.register_series_accessor("geo")

ENH: register GeoSeries as the pandas.Series.geo accessor #3272

Are you sure you want to change the base?

ENH: register GeoSeries as the pandas.Series.geo accessor #3272

Conversation

tswast commented Apr 27, 2024 • edited

tswast Apr 27, 2024

Choose a reason for hiding this comment

m-richards Apr 30, 2024

Choose a reason for hiding this comment

tswast May 8, 2024

Choose a reason for hiding this comment

tswast May 12, 2024

Choose a reason for hiding this comment

m-richards commented Apr 28, 2024

tswast commented Apr 28, 2024

m-richards commented Apr 29, 2024 • edited

tswast commented Apr 29, 2024

tswast commented Apr 29, 2024

martinfleis commented Apr 29, 2024

tswast commented Apr 29, 2024

tswast commented Apr 29, 2024

m-richards commented Apr 30, 2024

tswast commented Apr 30, 2024

ENH: register `GeoSeries` as the `pandas.Series.geo` accessor #3272

ENH: register `GeoSeries` as the `pandas.Series.geo` accessor #3272

tswast commented Apr 27, 2024 •

edited

m-richards commented Apr 29, 2024 •

edited