-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CRS section of metadata specification #40
base: main
Are you sure you want to change the base?
Conversation
extension-types.md
Outdated
information about the CRS. Note that regardless of the axis | ||
order specified by the CRS, axis order will be interpreted | ||
according to the wording in the | ||
[GeoPackage WKB binary encoding](https://www.geopackage.org/spec130/index.html#gpb_format): | ||
axis order is always (longitude, latitude) and (easting, northing) | ||
regardless of the the axis order encoded in the CRS specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this "Note that regardless .." down outside the "One of" sublist, since this is relevant for all different ways you can specify the CRS.
extension-types.md
Outdated
This key can also be omitted if the producer does not have any | ||
information about the CRS. Note that regardless of the axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "this key can also be omitted .. " will need to be updated.
+1 on adding the flexibility to allow any generic string repr (with the recommendation to use PROJJSON)
While I certainly agree the consistency with GeoParquet is nice, I do think that some of the arguments for making this choice in GeoParquet don't fully apply here. IIRC, for GeoParquet, one of the reasons was to make it easier to produce data without having a full-fledged geo stack available that supports projjson. But, in our case, since we don't require to use the json format, you can actually also easily put "EPSG:4326" if you know it's lon/lat. Also, for GeoArrow, one of the main use cases is to pass information along with the data from one system to another. I think we should very strongly recommend to always be explicit about the CRS in such a case (while I know for GeoParquet some argued for defaulting to have the CRS field not present as indication for lon/lat) |
Another option could be to say that the CRS metadata is not optional, so we force to be explicit (explicitly lon/lat or null, instead of one of those as the default). Not sure I like this myself though. |
Agreed on all points! I removed the "omitted == lon/lat" option...I think there is a real danger that accidentally omitting the CRS will result in spurious interpretations. |
Linking this to a bunch of previous discussion in #8 |
Thanks for all comments! If there are no objections, I'm planning to merge tomorrow afternoon 🙂 |
extension-types.md
Outdated
- A JSON object describing the coordinate reference system (CRS) | ||
using [PROJJSON](https://proj.org/specifications/projjson.html). | ||
- A string containing an undefined CRS representation. This option | ||
is intended as a fallback for producers (e.g., database drivers or | ||
file readers) that are provided a CRS in some form but do not have the | ||
means by which to convert it to PROJJSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's either JSON or a random string, how do you know whether to try to parse it as json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to allow anything I think it would be a lot easier to have different keys for different value types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire metadata is JSON, so at the point that it can be inspected it will already have been parsed (if using the JSON option). If using the "random string" option, it will remain an (unparsed) string.
- Full metadata with CRS encoded as JSON:
{"crs": {"key": "value"}}
- Full metadata with CRS encoded as random string:
{"crs": "EPSG:4326"}
I don't know of a language where differentiating between those two cases is not trival (e.g., in Python it could be isinstance(metadata["crs"], str)
; in R it could be is.character(metadata$crs)
; in C the differentiation happens during parsing...if the first non-whitespace character after :
is {
, it's a JSON CRS).
Is there a something here that I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot that this was already inside a JSON object. I thought for a second that it was a top-level key in the Arrow extension metadata
Maybe it would be useful to consider flatgeobuf's approach of having various ways to represent the CRS: https://github.com/flatgeobuf/flatgeobuf/blob/b013ac326568914aee43bcf3aa3c77574d629aa5/src/fbs/header.fbs#L58-L65. Or STAC's PROJ extension, which allows any of
In this case what did you have in mind for the CRS value in geoarrow? If the application has PROJ on hand, it can convert the WKT to PROJJSON, but otherwise it dumps the WKT in the same field? Then you're trying to maybe parse as JSON and maybe parse as WKT? Sounds like a mess. |
Did this get changed back in the PR? |
Yes - the updated text now reads that a missing CRS indicates that no information about the CRS is known. |
I think the difference between the two linked specs is that the only representation available there are strings. Here, we have all of JSON available to us, so it can be an object, a string, a number, or a null; PROJJSON is already helpfully parsed by whatever is parsing the metadata. We could add a "crs_type" key alongside the "crs" to disambiguate the parsing of the string if this is a problem for implementors?
At the end of the day it's pretty much all just |
I was thinking this might be useful |
It very well may be! I think I would prefer to defer adding it for now and introduce it with a specific usage difficulty in mind. Adding the key later would be backward compatible (if absent, it would continue to mean that the producer does not know what the CRS representation is). Or if there's a current example of a usage difficulty we could tackle it now? I would hope that implementations only use the "random string" method if absolutely necessary (e.g., anything with access to PROJ should be writing a PROJJSON object, which will be true for anything coming from GeoPandas or the R spatial stack). All of the use-cases I have in mind for the "random string" method involve not knowing anything about the representation, so I can't immediately think of a time I would actually set the value. |
What about converting e.g. from FlatGeobuf to GeoArrow without access to PROJ, such as in the browser or Node. FlatGeobuf doesn't include PROJJSON, so I'd have to include one of the other FlatGeobuf CRS forms, but it seems like it would be useful to know that the CRS is e.g. WKT or an EPSG code |
It definitely might be useful but does come at a non-trivial implementation complexity. I'm game to add the change and do the implementation...I think my point here is just that it's a consideration that we can defer until one of us runs into an implementation problem (e.g., when it can be demonstrated that there is a performance cost to guessing, that there are cases where guessing is not reliable, there is an implementation that does not have the capability to guess). There are quite a lot of things to implement in the meantime! |
Being able to put anything in there just smells bad. Reading back through the discussion in #8... is I agree there are other things to implement in the meantime, but it feels like it's always easier to expand a spec than narrow it down. |
…s-a-string can of worms.
The sheer number of options is I think the reason why the specification should not enter into the discussion lightly. This change is intended to empower producers to be good citizens and make a clear path form them to pass on CRS information they have (the other options being to omit it or attempt to pass it along using a method not defined in this specification). It's a valid discussion whether or not there should be a "crs_type" key to provide even more information; however, its contents and how that contents constrains the contents of the "crs" key is a potentially large can of worms and has a much wider scope of what I'm trying to do with this PR. Even if there is a "crs_type" key, there still needs to be a fallback option for producers that just don't know: this PR enables that fallback option. I added some language to hopefully clarify the intent of allowing a string representation (and reiterate that if a producer needs any kind of guarantee about how a consumer will interpret the CRS, they need to use PROJJSON). |
With some delay, trying to wrap my head around this discussion (and sorry for the long post .. and more questions than answers) To start, I think there is agreement on allowing something beyond PROJJSON, right? (with typical use case: implementation of some reader that doesn't have access to PROJ to convert the format-specific CRS representation to PROJJSON) Then the next question is: what exactly do we allow? Do we want to be strict and clearly define a strict set of options (for example like the stac extension having epsg, wkt2 or projjson options)? Or do we want to be flexible and also allow essentially any, potentially even unknown, format (but always as a string, I suppose)?
Let's take the example of flatgeobuf: this format has a WKT field, but is that guaranteed to be WKT2? My understanding is that it is not (flatgeobuf/flatgeobuf@c1e9f89), and so if we require WKT2, that would still be a problem (although for this use case you could also code the auth:code way, as that will typically be available in the fgb file) Another question might also be: do we want to allow multiple formats at the same time? (i.e. in the same metadata) For example, STAC and FlatGeoBuf essentially do this, allowing to store multiple representations (and PostGIS spatial_ref_sys table in the end does the same as well) And a final question: if we want to allow multiple formats, how to specify the specific format that being used?
|
Thank you for the detailed summary! These are all good questions. The motivation for this PR is to solve a right now problem: it is currently not possible for non-spatial aware producers to include CRS metadata. From the previous discussion, I think there are two ways to solve that:
Neither of these options would change the case where the producer has no CRS information (the Given the content of this PR, I obviously feel that the first option is best. This is because I don't know of any situation in which the content of I am primarily working to embed this specification with engineers who don't know about GIS and need convincing that any geospatial type metadata is important to propagate (I'm billing "crs" as a timezone but, like, for map things). I am happy to advocate that We could also provide some guidance as documentation for engineers writing producers that might not have any background in GIS (e.g., on AUTH:CODE, WKT2 vs WKT, etc.); however, that change is more closely tied to adding a |
A couple of points since this is exactly the same discussion we had with GeoJSON and you know where we ended up with that 😏
It all boils down to "do I really need infinite possibilities of coordinate systems supported by my format?" The answer to that is almost always 'NO'. Now that I'm old, I've come to rely on one true thing... if GDAL's |
Thank you! One of the challenges with this spec is that we're not quite sure exactly how (or if?) it will be used. We sort of have three audiences in mind:
Not all of those audiences may end up using this spec...I suppose with this PR I am asking for some flexibility to prototype those low-level tools within the (draft) spec. If nobody uses them, I have no problem tightening the spec back up (just as GeoJSON did!). |
This is the reason why I've been implementing GeoParquet and Arrow (not GeoArrow, just XYZ columns of Arrow IPC) to PDAL. See PDAL/PDAL#4115 |
Hey me too! 😂 There's two of us!
Sorry, can you expand on why it's easier for you to implement |
Because point clouds are simply XYZ columns and arrow-cpp API has an easy to use IPC writer. |
GeoArrow is simply an XYZ struct with a CRS that doesn't get lost when you put it across an application boundary 🙂 |
Yeah, I wasn't clear but I meant to say: you're so close to supporting GeoArrow already! If you just wrap the three columns into a logical struct can give it the extension name |
So what happens when you compress structs? As individual columns, they compress decently due to not much change in row-to-row value within each dimension. As a struct, the number of bits changing row-to-row is substantial. |
The key is that in arrow the struct is logical. So each of the X, Y, Z columns (within the struct!) are still their own separate buffers. If you save those to parquet, the physical buffers are exactly the same as if you saved them as three different logical columns. There's no interleaving happening when you store a struct column (that only happens if you store a Here's an example: Setup: import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
x = np.full(100, 1)
y = np.full(100, 2)
z = np.full(100, 3) Creating a table with three separate columns: table_separate_columns = pa.table({
'x': x,
'y': y,
'z': z
})
pq.write_table(table_separate_columns, 'table_separate.parquet')
So in this case, they overlap. If we read the Arrow schema and the Parquet metadata, they both have three columns: pq.read_schema('table_separate.parquet')
# x: int64
# y: int64
# z: int64
pq.read_metadata('table_separate.parquet')
# <pyarrow._parquet.FileMetaData object at 0x16b424680>
# created_by: parquet-cpp-arrow version 12.0.0
# num_columns: 3
# num_rows: 100
# num_row_groups: 1
# format_version: 2.6
# serialized_size: 708 But consider what happens when we write a struct array: struct_column = pa.StructArray.from_arrays([x, y, z], ['x', 'y', 'z'])
table_one_column = pa.table({
'xyz': struct_column
})
pq.write_table(table_one_column, 'table_one_col.parquet') Now the Arrow schema has one logical column, while three columns were written to Parquet:
And if you look at the file size on disk, it's virtually the same for each:
And 86 of those extra bytes in the first file are from the larger size of the metadata stored in the Parquet file. That's Arrow's metadata ensuring that Arrow can accurately round trip the data to and from Parquet without any loss |
@kylebarron @jorisvandenbossche Is there any middle ground we can settle on here? I am intererested specifically in how to communicate a CRS whose encoding is unknown (in a way that does not prevent future specialization if required). Joris suggested a separate key (i.e., The obsolete GeoJSON spec uses something like We could do There was some talk of compatibility with the GeoParquet spec; however, I don't think that keeping full compatibility with the GeoParquet |
I agree that we should have some way to declare an unknown CRS, just also feel that we want to store the CRS in some structured way. I actually hit this in geoarrow/geoarrow-rs#196, where I implemented writing from GeoArrow to FlatGeobuf, and realized that I won't be able to create the I could be convinced but don't immediately love the It appears impossible to be able to write both FlatGeobuf and GeoParquet without access to PROJ, because FlatGeobuf doesn't support PROJJSON and GeoParquet requires PROJJSON. |
I agree that the separate key thing is probably not quite what we want. That leaves:
This PR as currently worded is option 2. I would also be happy to implement option 1! Which would you prefer?
This PR is entirely about being able to read data without PROJ, and I think it makes sense to have a |
Is there ever a case where you want to store multiple representations of the same CRS, so that a consumer can choose whichever is easiest for them to use? Like FlatGeobuf storing both EPSG:4326 and its WKT2 representation. I think variant 1 might make it easier to adopt specified CRS types in the future at least?
Even implemented PROJJSON -> WKT2 in https://github.com/rouault/projjson_to_wkt; I made an issue about the converse and he said it's possible, especially if restricting to |
I suppose that would be useful for your example of writing FlatGeoBuf? (because if the source data only has PROJJSON, you need PROJ access in the writer to convert PROJJSON to WKT (although fgb might also be fine with just an auth:code that you might be able to extract from PROJJSON?) |
I wouldn't call it "crs_unknown", but maybe rather something like "crs_source" (or "crs_text", or "crs_description"). Then you either have well specified PROJJSON in the To be clear, I don't know if this is better than the other two options that Dewey listed above. It was just a random thought to have something that is also backwards compatible with the existing "crs" key being PROJJSON (and compatible with the geoparquet spec). And it gives the guarantee that "crs" is always well specified, if present. |
At the GeoArrow level I think this would result in some pretty verbose language in the spec around which one you should pick if both are specified. It is perhaps worth discussing, but definitely not required for the initial version of the spec (how to represent an unknown CRS is, though!).
I am not sure. I think we should add |
There are several flavors of WKT2. WKT2 is actually not an official name at all. The first flavor of WKT2 is formally "Well-known text representation of coordinate reference systems v1.0" (OGC 12-063r5 : https://docs.ogc.org/is/12-063r5/12-063r5.html), also registered as ISO 19162:2015(E), that we dubbed WKT2:2015 in PROJ. And the latest one is "Well-known text representation of coordinate reference systems v2.1.11" (OGC 18-010r11 https://docs.ogc.org/is/18-010r11/18-010r11.pdf / ISO 19162:2019/Amd 1:2023), aka WKT2:2019 in PROJ. For WKT1, see https://proj.org/en/9.3/development/reference/cpp/cpp_general.html#general_doc_1WKT1 . Good luck to make that clear to non-geo folks (and even to geo ones :-)) |
Including me! I certainly don't feel qualified to come up with the language around what constraints should and should not be placed on the content of the I think my main point in this PR is that (1) it is useful for some producers to put something that isn't PROJJSON in the |
This PR updates the documentation for the
"crs"
metadata key to (1) better align with the GeoParquet specification and (2) ensure that producers have the option to provide whatever CRS information they have in available. Real-world examples of producers where this matters are:auth_name
,auth_srid
,srtext
, andproj4text
.srs_name
,srs_id
, anddefinition
.prj
sidecar file contains "the crs" but can contain it in various (unspecified) forms.Allowing flexibility in the CRS definition is a balance between making it easy for producers to participate and pass on information that is essential for interpreting coordinates and making it easy for CRS-aware consumers to consume GeoArrow arrays.
The other notable change here is the change that an omitted CRS defaults to longitude/latitude. This is to align with the GeoParquet specification and I recall that decision was arrived at after some very long debates. I don't particularly agree with this decision but I do see the value of this assumption in the data engineering world where there are a lot of practitioners without geospatial experience.