Skip to content

Latest commit

 

History

History
158 lines (122 loc) · 6.64 KB

DESIGN.rst

File metadata and controls

158 lines (122 loc) · 6.64 KB

Design Notes

Rasterio's design can be deduced from its code, but we can make it even more comprehensible by writing about it in simple language. That's what this document is about: describing the abstractions and design of the software to project developers.

Rasterio has low level abstractions and higher level abstractions. Let's be clear: none of them are as high as some users want. Rasterio has no zonal stats feature. No NDVI feature. No interactive mapping features. But it does provide low-level abstractions that can be used to build these features in other applications.

Interfaces

Rasterio has interfaces that are not yet described using abstract base classes or other formal interface system. The following subsections describe them briefly.

DataAccessor

This interface is involved with opening a dataset for access and is implemented by the DatasetReader and DatasetWriter classes. Their constructors take a str or os.PathLike object and, internally, attempt to adapt it to a rasterio.path.Path object.

A DataAccessor is in some ways analogous to a Python I/O stream. It has an access mode: "r", "r+", "w", or "w+". It can be in open or closed state. It is a context manager. It has methods that read or write unlabeled arrays of raster pixels to or from a dataset or optional windows (think slices) of a dataset. A DataAccessor has more attributes than a Python I/O steam. There's no "encoding" but there is a "crs" describing the coordinate reference system for the pixels and a "transform", "gcps", or "rcps" attribute describing how the array indices map to coordinates in that system.

Raster bands are not one of rasterio's abstractions. We don't read data from the band of a dataset. We read multi-dimensional data from a dataset via a DataAccessor.

Array

A DataAccessor trades in not-sparse (dense) unlabeled Numpy arrays with a minimum dimension of 2: row and column, or line and pixel. In the case of multichannel/multiband datasets, like RGB imagery, there can also be a third dimension corresponding to the channel or band. For these, the dimensions would be: band, row, and column, in that order.

Elements of these arrays generally represent values integrated over an area. Gridded, possibly sparse, point data can be handled, but it is not the default as it is with, for example, xarray.

rasterio.path.Path

GDAL's GDALOpenEx takes an array of UTF-8 encoded bytes as its primary argument. These bytes may contain a filename, a URL, an RDBMS connection string, XML, or JSON. Almost any kind of dataset address, really. GDAL puts no constraint on the content at all. A future format driver might use an array of emoji to address data and GDAL would be fine with that.

A rasterio.path.Path object contains a GDAL dataset address and has an as_vsi() method, the result of which can be UTF-8 encoded and given to GDALOpenEx.

This interface isn't meant for public consumption. We might make it private, to the extent that anything can be private in Python.

DataPath

By analogy to Python's pathlib.Path, a rasterio DataPath has an open() method that returns a DataAccessor.

rasterio.io.MemoryFile and rasterio.io.FilePath implement the DataPath interface.

Tools

The issue at #1300 describes rasterio's higher level tool abstraction. A tool is more or less the guts of a command line program, minus the argument and option parsing. It works on named datasets, not on arrays or Python objects.

The tool abstraction is: given names of input and output files and driver and environment configuration parameters, the tool transforms pixels quickly and efficiently, absorbing the complexity of lazy data loading and concurrency.

Opening a dataset

rasterio.open() accepts a variety of inputs and returns a DataAccessor.

If the input implements DataPath, open() delegates to the input object. If the input can be adapted to DataPath, open() delegates to the adapter. If the input is a str or os.PathLike, it is adapted to rasterio.path.Path and passed to a DataAccessor constructor.

Data types

Rasterio uses Numpy data types and translates these to GDAL types before calling GDAL methods.

GDAL context

GDAL relies on global state in the form of format drivers, a connection pool, an error stack, caches, and configuration for these and optional software features. Rasterio presents this context as a Python object: rasterio.env.local._env. The rasterio.env.Env context manager is rasterio's abstraction for configuration of the context. Importing rasterio creates the absolute minimum of GDAL global state. It is not until an instance of rasterio.env.Env is created and its context is entered, whether explicitly or implicitly (by calling rasterio.open), that format drivers are registered and rasterio.env.local._env becomes not None.

Many methods of rasterio require GDAL's context to be fully initialized. To make this easy to ensure, we can use decorators from the rasterio.env module. See for example the exists function in rasterio/shutil.pyx.

Errors and exceptions

GDAL maintains an error stack and a registry of handlers that are called when an error is pushed onto the stack. Rasterio registers a handler that routes GDAL error messages to Python's logger. We don't enable registration of other handlers. Instead, users and developers should work with Python's logger. Additionally, we check the error stack after calling GDAL functions from Cython extension code and raise a Python exception if the last error is of GDAL type >= 3. Several functions in rasterio._err exist to help: exc_wrap_int, exc_wrap_pointer, etc.

GDAL raster band cache

GDAL has a per-process in-memory LRU (least recently used) raster block cache. A DataAccessor's read method results in cached blocks. Subsequent reads from the same accessor may reuse those cached blocks. Calling a DataAccessor's write method will update cached blocks. Cached blocks are written to the dataset's storage when evicted from the cache or when the DataAccessor is closed, flushing all the dataset's cached blocks.

Rasterio has no abstraction for this cache.

Command line interface

Rasterio includes a command line program named "rio". It shares a set of options with the "fio" program from the Fiona project (the vector counterpart to rasterio). The rio program has one level of subcommands. The subcommands do different things, though there is a little bit of overlap so that users don't always have to call multiple commands to get a slightly different result. Raster operations don't compose as readily as line-oriented text operations do.