Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalization of compression for spatial targets with GDAL #37

Open
njtierney opened this issue Mar 16, 2024 · 4 comments
Open

Generalization of compression for spatial targets with GDAL #37

njtierney opened this issue Mar 16, 2024 · 4 comments

Comments

@njtierney
Copy link
Owner

njtierney commented Mar 16, 2024

Just pulling this from #4 as I'm not sure we captured this as an issue?

Generalization of the "multiple file target compression" GDAL /vsizip/ approach to all backends and formats that support it

From @brownag

The /vsizip/ GDAL virtual file system functionality used in format_shapefile() is an example of something that can be generalized further with a focus on generic GDAL data source paths. I think the idea of being able to compress files that are in the target store (and keep them compressed) is attractive for spatial data which can be quite large--even if targets are not comprised of multiple files.

Since GDAL can read from the compressed target store efficiently, you get the benefit of less file size footprint while also being able to read the file without fully extracting it.
Also should consider some of the other archive file formats/virtual file system types, and providing interfaces in R to produce them e.g. /vsigzip/ or /vsitar/ analogs to /vsizip/ + utils::zip().
Even without creating specific compressed archive files, there should be robust tools available for controlling GDAL file compression options, supported by many drivers, that are used to write target objects
The ZIP approach is useful for GeoTIFF files where category information is stored in the .tif.aux.xml sidecar file. Convenience methods for terra SpatRaster objects could automatically store a target as a ZIP file (and give warnings about target naming) if the input SpatRaster is categorical and output format is GeoTIFF.

@brownag
Copy link
Contributor

brownag commented Mar 16, 2024

Thanks for splitting this out, I wanted to make one after closing of #4 but didnt get to it yet.

I have been tinkering with some implementations for this issue and will have a draft PR in not too distant future

@Aariq
Copy link
Collaborator

Aariq commented Apr 17, 2024

Is there a reason to not just zip outputs of all GDAL drivers, even ones that are a single file? Are there downsides to using /vsizip/ ? e.g. is it not available in some instances?.

@mdsumner
Copy link

mdsumner commented Apr 18, 2024

Having the extra zip layer is a bit weird for formats that are both single-file and include internal compression. And, there's the zip layer to read through so it's less efficient. Note that GDAL added SOZip capability, which cloud-i-fied storing file/s within zip and made it very fast (not all zips will be as efficient). I don't think you'd want logic to determine if a GeoTIFF is not compressed to pivot on, even that has some explosion of option combinations. I think these kinds of choices are out of scope for this project (but very keen to discuss).

Its support is GDAL and build dependent, so on CRAN you are at the behest currently of the Windows maintainer's efforts, mostly guided by Roger Bivand in the past, and similarly for Mac, and then the binary installers that align to linux builds. That's probably a good level to track to specify 1) version/s and 2) capabilities to make some boundaries.

There's a lot of other subtleties too, because files like GeoTIFF and Geopackage could have sidecar files (that's how GDAL supports categorical rasters Raster Attribute Tables, RAT) for GeoTIFF for example, and there are controls about whether sidecar files are searched for at URLs and directories ... so, apologies all I can think of are details and complications. I think generally it's not a good idea to add a zip or any other layer unless you really need to, it's better to move to and advise modern formats (GeoTIFF, (Geo)Parquet, FlatGeobuf, Zarr) - but if you need to, the zip container can be a good solution (bundle up one or many shapefiles, or MapInfo files, or CSVs or many other options). Note there are also virtual file system support for gzip, tar, Azure, AWS, Google storage, on and on so I tend to suggest stay as close to what GDAL can do without adding layers (but, that's not a straightforward topic without putting some pretty tight boundaries on the scope).

@Aariq
Copy link
Collaborator

Aariq commented Apr 18, 2024

Oof, yeah I can forsee us having to do a lot of thinking around this and it might be best for geotargets to be opinionated and only allow/recommend certain file formats that we are confident will work with targets with different GDAL versions and OSs. E.g., just today I noticed that the "COG" driver produces sidecar aux.xml files on my university HPC but not when the same code is run locally (different GDAL versions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants