You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tested forecast models with a catalog whose events sometimes occur exactly on a cell boundary.
I found that region.get_index_of generally does not assign them the spatial bin according to what is expected from
but the opposite (i.e., the neighboring bin). Although this function considers a tolerance (numpy.finfo(numpy.float64).eps) to (supposedly) account for numerical precision, I found that it is not sufficient to get the desired binning behavior, even after the revision in commit 949d342c.
I believe the corresponding unit tests don't catch these literal "corner cases", as those tests are too constructed/synthetic and not based on real data such as a loaded CSEP region. Specifically, they do not consider issues due to double precision (float64).
Demo & Reasoning
1. The problem
Let's do a quick self check:
region=csep.core.regions.california_relm_region()
idx=3origin=region.origins()[idx]
print(idx, origin)
idx2=region.get_index_of(*origin)
print(idx2, region.origins()[idx2]) # not the same as `idx` and `origin`
outputs:
3 [-125.4 40.4]
2 [-125.4 40.3]
Ooops. Not even a bin's origin is considered to be inside its own bin (but the neighboring one)!
Btw: the above output may represent the coordinates as more precise than they numerically are:
print(idx, [x for x in origin])
gives:
3 [-125.39999999999999, 40.400000000000006]
which highlights that our unit tests are currently too optimistic/simplistic.
2. The correction
Only when adding a tiny amount to the Latitude and Longitude of the event coordinates gives the desired behavior:
tinyamount=1e-12idx3=region.get_index_of(
origin[0] +tinyamount,
origin[1] +tinyamount
)
print(idx3, region.origins()[idx3]) # same as `idx` and `origin`
outputs:
3 [-125.4 40.4]
Now we can do this self check for all origins in the region:
# tinyamount = 0 # uncomment for raising an Errorfori, origininenumerate(region.origins()):
asserti==region.get_index_of(
origin[0] +tinyamount,
origin[1] +tinyamount
)
Only if (at least) tinyamount = 1e-12 is added, the loop runs without Error.
Side note: Actually, if tinyamount = 0, it wouldn't raise an AssertionError, but this ValueError in get_index_of for the first bin, because get_index_of tries to assign a bin that lies East of it (which doesn't exist and is outside the region). You can get an AssertionError like so:
But this numerical issue gets slightly more complicated than that:
a) Influence of the region
The region influences (i) the tinyamount that is just sufficient to bin correctly, and (ii) which coordinate component(s) must be corrected. For instance for Italy, this is sufficient:
region=csep.core.regions.italy_csep_region()
tinyamount=2e-13# Note #1: smaller than for California regionfori, origininenumerate(region.origins()):
asserti==region.get_index_of(
origin[0], # Note #2: adding `tinyamount` to the Longitude NOT needed (!)origin[1] +tinyamount
)
I don't know why, but it's certainly related to numerical representations of values.
b) Influence of the floating-point format/precision
The just-sufficient tinyamount also depends on the floating-point format (i.e., precision). For instance, assuming the events have single precision (float32), this is necessary:
tinyamount=1.6e-6# for Italy regiontinyamount=3.1e-6# for California regionfori, origininenumerate(region.origins()):
asserti==region.get_index_of(
origin[0].astype(np.float32) +tinyamount, # (adding `tinyamount` not necessary for Italy region)origin[1].astype(np.float32) +tinyamount
)
Summary
I tried to find out the just-sufficient threshold also as function of the corresponding .eps and .resolution property of np.finfo for the used precision (e.g., np.finfo(np.float64).eps):
Region
Precision
tinyamount
.resolution *
.eps *
California
float64
1.0e-12
1000.0
~4400.0
Italy
float64
1.4e-13
140.0
~630.0
California
float32
3.1e-6
3.1
~26.0
Italy
float32
1.6e-6
1.6
~13.0
Apparently, tinyamount not only depends on the region, but it also doesn't scale with the precision.
Suggested solution(s)
Some options (with increasing personal preference):
add the required tinyamount to event coordinates;
use bin1d_vec's tol argument, which is not used by the get_index_of wrapper;
simplify the tolerance section of bin1d_vec by only adding a tinyamount.
Option 1 is only a temporary (quick-)fix.
For Option 2, note that tol = tinyamount is not sufficient due to the follow-up 'absolute tolerance' section. It needs to be slightly higher, e.g., tol = tinyamount * 1.1 in our cases.
- a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps- h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps- p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps--- # absolute tolerance- if tol is None:- idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))- else:- idx = numpy.floor((p + (tol + a0_tol) - a0) / (h - h_tol))+ tol = tol or 0++ dtype = np.asarray(p).dtype+ if dtype.name in ('float64', 'float32'):+ float_tol = numpy.finfo(dtype).resolution * {+ 'float64': 2000,+ 'float32': 10,+ }[dtype.name]+ tol = max(tol, float_tol)++ idx = numpy.floor((p + tol - a0) / h)
I added a safety factor of 2-3 compared to Table 1 just to be more on the safe side.
Btw: We could even always add numpy.finfo(numpy.float32).resolution * 10 = 1e-5 independent of the floating-point format, which is on the geographical order of ~1 meter. So instead of the above change, this one:
+ float_min_tol = 1e-5 # conservative (minimum) tolerance+ tol = max(tol or 0, float_min_tol)+ idx = numpy.floor((p + tol - a0) / h)
If a catalog contains such "corner case" events, fixing this issue will change the test statistic – but only slightly.
Nevertheless, it would lead to irreproducible previous results (e.g., in my case for IGPE, ~at the 3 to 5th decimal digit).
Exactly such slight difference (compared to an independent binning implementation) was the reason for spotting this issue.
A personal note
Considering that event locations are sometimes very rough, it might have been a better idea to define grid cells so that their centers lie on rough coordinates (e.g., 13.5°), rather than their boundaries (13.45° / 13.55°); i.e., shift the grid by half a bin size in each direction.
I tested forecast models with a catalog whose events sometimes occur exactly on a cell boundary.
I found that
region.get_index_of
generally does not assign them the spatial bin according to what is expected frombut the opposite (i.e., the neighboring bin). Although this function considers a tolerance (
numpy.finfo(numpy.float64).eps
) to (supposedly) account for numerical precision, I found that it is not sufficient to get the desired binning behavior, even after the revision in commit 949d342c.I believe the corresponding unit tests don't catch these literal "corner cases", as those tests are too constructed/synthetic and not based on real data such as a loaded CSEP region. Specifically, they do not consider issues due to double precision (float64).
Demo & Reasoning
1. The problem
Let's do a quick self check:
outputs:
Ooops. Not even a bin's origin is considered to be inside its own bin (but the neighboring one)!
2. The correction
Only when adding a tiny amount to the Latitude and Longitude of the event coordinates gives the desired behavior:
outputs:
Now we can do this self check for all origins in the region:
Only if (at least)
tinyamount = 1e-12
is added, the loop runs without Error.3. The complexity
But this numerical issue gets slightly more complicated than that:
a) Influence of the region
The region influences (i) the
tinyamount
that is just sufficient to bin correctly, and (ii) which coordinate component(s) must be corrected. For instance for Italy, this is sufficient:I don't know why, but it's certainly related to numerical representations of values.
b) Influence of the floating-point format/precision
The just-sufficient
tinyamount
also depends on the floating-point format (i.e., precision). For instance, assuming the events have single precision (float32), this is necessary:Summary
I tried to find out the just-sufficient threshold also as function of the corresponding
.eps
and.resolution
property ofnp.finfo
for the used precision (e.g.,np.finfo(np.float64).eps
):tinyamount
.resolution *
.eps *
Apparently,
tinyamount
not only depends on the region, but it also doesn't scale with the precision.Suggested solution(s)
Some options (with increasing personal preference):
tinyamount
to event coordinates;bin1d_vec
'stol
argument, which is not used by theget_index_of
wrapper;bin1d_vec
;bin1d_vec
by only adding atinyamount
.Option 1 is only a temporary (quick-)fix.
For Option 2, note that
tol = tinyamount
is not sufficient due to the follow-up 'absolute tolerance' section. It needs to be slightly higher, e.g.,tol = tinyamount * 1.1
in our cases.For Option 3, we could do:
Compared to
tol
,tol_fact
is relative; I tried to find out the necessary thresholds again:tol_fact
.resolution *
.eps *
It also doesn't scale with the precision of the floating-point format, so I don't see the benefit of this option.
For Option 4, we could revert the whole current tolerance section that was introduced in commit 949d342c to a prior state and extend it (keeping the
tol
argument):I added a safety factor of 2-3 compared to Table 1 just to be more on the safe side.
Btw: We could even always add
numpy.finfo(numpy.float32).resolution * 10 = 1e-5
independent of the floating-point format, which is on the geographical order of ~1 meter. So instead of the above change, this one:This would be the same as (and replace!) setting/fixing
tol=0.00001
when binning the magnitude across various places in pyCSEP code.The implication
If a catalog contains such "corner case" events, fixing this issue will change the test statistic – but only slightly.
Nevertheless, it would lead to irreproducible previous results (e.g., in my case for IGPE, ~at the 3 to 5th decimal digit).
Exactly such slight difference (compared to an independent binning implementation) was the reason for spotting this issue.
A personal note
Considering that event locations are sometimes very rough, it might have been a better idea to define grid cells so that their centers lie on rough coordinates (e.g., 13.5°), rather than their boundaries (13.45° / 13.55°); i.e., shift the grid by half a bin size in each direction.
P.S.
@khawajasim: Could
QuadtreeGrid2D._find_location
be affected by the same issue?The text was updated successfully, but these errors were encountered: