BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays #2724

groutr · 2023-01-23T18:01:12Z

When get_minimum_dtype was called with np.inf or np.nan, it would return float64. This PR fixes it to only do the range check on the finite values.

Additionally, I return the smallest dtype when the containers are empty, though, it could be argued to let the exceptions bubble up. Another approach would be to return None (ie GDT_Unknown). What would be preferred approach? My opinion is that either the smallest dtype (bool) or None should be returned.

ping @sgillies

Split off of #2722

Return the smallest dtype.

snowman2 · 2023-01-25T03:47:13Z

Thanks for these changes. I think in the scenario when you don't have any data to go off of, it is safer to pick the largest dtype. I think that is why numpy uses float64 for an empty array.

>>> import numpy
>>> arr = numpy.array([])
>>> arr
array([], dtype=float64)

For that reason, I suggest we follow the logic numpy is using and instead default to float64 for empty arrays.

snowman2 · 2023-01-25T04:00:46Z

After thinking about it further, I think what you have makes sense for the "minimum" dtype. So, ignore my previous comment.

snowman2

Thanks @groutr 👍

groutr · 2023-01-25T19:03:19Z

@snowman2 After looking over this PR some more, I realized there is some other subtle issues with get_minimum_dtype.

Bool should be return if range [0, 1] (was returning uint8)
Never returns complex dtype.

I restructured get_minimum_dtype to handle both of those issues and be more comprehensible.

After thinking it over more, None might be a more consistent return (or maybe even raise a ValueError?) on an empty container. There are two ways of thought, and I'm not sure which is the preferred way: 1) No values mean an unknown dtype (GDT_Unknown) and therefore return None/raise an error or 2) No values mean any dtype is possible and therefore return the smallest/minimum dtype.
If get_minimum_dtype returned None or raised an error, it means every call would need to be checked and calling code would look like

# returning None
dtype = get_minimum_dtype([])
if dtype:
    np.can_cast(dtype, 'float32')
else:
    # no values, or some unrecognized dtype (not int/float/complex)

# raise an error
try:
    dtype = get_minimum_dtype([])
    np.can_cast(dtype, 'float32')
except ValueError:
    print("No values???")
else:
    print("Unknown dtype (not int/float/complex)

I could see the constant checks after calling get_minimum_dtype could be annoying.

snowman2 · 2023-01-26T03:17:36Z

Bool should be return if range [0, 1] (was returning uint8)

See #2657

Boolean is not a GDAL type in the mapping:

rasterio/rasterio/dtypes.py

Lines 32 to 45 in 6bb3e52

 dtype_fwd = { 

None, # GDT_Unknown 

ubyte, # GDT_Byte 

uint16, # GDT_UInt16 

int16, # GDT_Int16 

uint32, # GDT_UInt32 

int32, # GDT_Int32 

float32, # GDT_Float32 

float64, # GDT_Float64 

complex_int16, # GDT_CInt16 

complex64, # GDT_CInt32 

complex64, # GDT_CFloat32 

complex128, # GDT_CFloat64 

 }

In rasterize, it doesn't have the boolean dtype:

rasterio/rasterio/features.py

Lines 255 to 257 in 6bb3e52

 valid_dtypes = ( 

 'int16', 'int32', 'uint8', 'uint16', 'uint32', 'float32', 'float64' 

 )

Boolean appears to only be used for masks. From what I can tell, uint8 appears to be correct.

snowman2 · 2023-01-26T03:20:41Z

@groutr, I think you have raised some valid points. I think the inf/nan changes are good to go. However, the empty array logic appears to need more thought/discussion. Mind moving the empty array changes/discussion to a separate PR?

snowman2 · 2023-01-26T03:22:48Z

Never returns complex dtype.

Mind moving these changes into a separate PR as well? I thin this topic is "complex" enough to be on its own 😄

groutr · 2023-01-26T16:27:13Z

@groutr, I think you have raised some valid points. I think the inf/nan changes are good to go. However, the empty array logic appears to need more thought/discussion. Mind moving the empty array changes/discussion to a separate PR?

Let's get #2734 merged first. Having several different open PRs against the same function is really confusing for me. With that PR merged, it would remove the non-array path and this PR could be simplified.

Regarding the bool return type, I was going off the docstring for get_minimum_dtype where it says the return is a "rasterio dtype string". What that is exactly isn't clear, but I took it to mean the these:

bool_ = 'bool'
ubyte = uint8 = 'uint8'
sbyte = int8 = 'int8'
uint16 = 'uint16'
int16 = 'int16'
uint32 = 'uint32'
int32 = 'int32'
uint64 = 'uint64'
int64 = 'int64'
float32 = 'float32'
float64 = 'float64'
complex_ = 'complex'
complex64 = 'complex64'
complex128 = 'complex128'
complex_int16 = "complex_int16"

snowman2 · 2023-02-06T15:19:58Z

What that is exactly isn't clear, but I took it to mean the these

Your logic makes sense. However, since bool doesn't map to a GDAL dtype, this will cause troubles in the rasterize function it is is used in as it is looking for the rasterio dtype with a GDAL dtype equivalent.

To help make the function clearer, I am thinking that it may be a good idea to update the docstring.

groutr · 2023-05-02T21:56:28Z

@sgillies There are several questions in this PR that need guidance/clarification.

What is the proper return for empty values? I feel None is the appropriate return value. Bool is another possible return or we could raise an error.
Should get_minimum_dtype consider complex values? What about bool values?
What exactly is a rasterio dtype string? Are they numpy dtypes or GDAL types? Should we consider the mapping idea proposed here: rasterio.dtypes improvements #2763 (comment)?

sgillies · 2024-01-05T16:10:30Z

@groutr @snowman2 how about this?

get_minimum_dtype() should be calculated using np.nanmin() and np.nanmax().
operation on an empty array should give a ValueError like numpy does: ("ValueError: zero-size array to reduction operation minimum which has no identity"). But maybe a little more clear 😄

Removing implicit use of this function from rasterize() makes its ergonomics more important, yes? We're kind of asking users to call it in their own code instead. What would you think about changing the function def to take multiple values so that a developer can plug in their dataset's min and max values without creating a list? Behavior like Python's min() or max(). For example get_minimum_dtype(-9999, 15).

groutr added 2 commits January 23, 2023 10:13

Fix calling min/max on empty container.

9afbdf9

Return the smallest dtype.

Check range of only finite float values.

b07f891

snowman2 changed the title ~~Return true minimum dtype when np.inf/nan present~~ BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays Jan 25, 2023

snowman2 added the bug label Jan 25, 2023

snowman2 approved these changes Jan 25, 2023

View reviewed changes

snowman2 requested review from sgillies and vincentsarago January 25, 2023 04:01

snowman2 mentioned this pull request Jan 25, 2023

PERF: Optimize can_cast_dtype. #2726

Closed

snowman2 added this to the 1.4.0 milestone Jan 25, 2023

groutr added 4 commits January 25, 2023 11:25

Only return float when dtype is float.

0b35205

Handle bool.

cee90bf

Handle float/complex.

d01a137

Restructure get_minimum_dtype for better clarity.

f29f7e2

groutr mentioned this pull request Feb 8, 2023

rasterio.dtypes improvements #2763

Open

sgillies removed this from the 1.4.0 milestone Dec 21, 2023

sgillies added this to the 1.4.0 milestone Jan 5, 2024

sgillies added the enhancement label Jan 5, 2024

sgillies removed this from the 1.4.0 milestone Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays #2724

BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays #2724

groutr commented Jan 23, 2023 •

edited

snowman2 commented Jan 25, 2023

snowman2 commented Jan 25, 2023

snowman2 left a comment

groutr commented Jan 25, 2023 •

edited

snowman2 commented Jan 26, 2023

snowman2 commented Jan 26, 2023

snowman2 commented Jan 26, 2023

groutr commented Jan 26, 2023

snowman2 commented Feb 6, 2023

groutr commented May 2, 2023 •

edited

sgillies commented Jan 5, 2024

BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays #2724

Are you sure you want to change the base?

BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays #2724

Conversation

groutr commented Jan 23, 2023 • edited

snowman2 commented Jan 25, 2023

snowman2 commented Jan 25, 2023

snowman2 left a comment

Choose a reason for hiding this comment

groutr commented Jan 25, 2023 • edited

snowman2 commented Jan 26, 2023

snowman2 commented Jan 26, 2023

snowman2 commented Jan 26, 2023

groutr commented Jan 26, 2023

snowman2 commented Feb 6, 2023

groutr commented May 2, 2023 • edited

sgillies commented Jan 5, 2024

groutr commented Jan 23, 2023 •

edited

groutr commented Jan 25, 2023 •

edited

groutr commented May 2, 2023 •

edited