Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial binning doesn't properly account for floating point precision #255

Open
mherrmann3 opened this issue May 31, 2024 · 0 comments
Open

Comments

@mherrmann3
Copy link
Contributor

mherrmann3 commented May 31, 2024

I tested forecast models with a catalog whose events sometimes occur exactly on a cell boundary.
I found that region.get_index_of generally does not assign them the spatial bin according to what is expected from

"inclusive on the lower bound and exclusive on the upper bound." [utils.calc.bin1d_vec docstring]

but the opposite (i.e., the neighboring bin). Although this function considers a tolerance (numpy.finfo(numpy.float64).eps) to (supposedly) account for numerical precision, I found that it is not sufficient to get the desired binning behavior, even after the revision in commit 949d342c.

I believe the corresponding unit tests don't catch these literal "corner cases", as those tests are too constructed/synthetic and not based on real data such as a loaded CSEP region. Specifically, they do not consider issues due to double precision (float64).

Demo & Reasoning

1. The problem

Let's do a quick self check:

region = csep.core.regions.california_relm_region()

idx = 3
origin = region.origins()[idx]
print(idx, origin)

idx2 = region.get_index_of(*origin)
print(idx2, region.origins()[idx2])  # not the same as `idx` and `origin`

outputs:

3 [-125.4   40.4]
2 [-125.4   40.3]

Ooops. Not even a bin's origin is considered to be inside its own bin (but the neighboring one)!

Btw: the above output may represent the coordinates as more precise than they numerically are:

print(idx, [x for x in origin])

gives:

3 [-125.39999999999999, 40.400000000000006]

which highlights that our unit tests are currently too optimistic/simplistic.

2. The correction

Only when adding a tiny amount to the Latitude and Longitude of the event coordinates gives the desired behavior:

tinyamount = 1e-12

idx3 = region.get_index_of(
    origin[0] + tinyamount,
    origin[1] + tinyamount
)
print(idx3, region.origins()[idx3])  # same as `idx` and `origin`

outputs:

3 [-125.4   40.4]

Now we can do this self check for all origins in the region:

# tinyamount = 0  # uncomment for raising an Error
for i, origin in enumerate(region.origins()):

    assert i == region.get_index_of(
        origin[0] + tinyamount,
        origin[1] + tinyamount
    )

Only if (at least) tinyamount = 1e-12 is added, the loop runs without Error.

Side note: Actually, if tinyamount = 0, it wouldn't raise an AssertionError, but this ValueError in get_index_of for the first bin, because get_index_of tries to assign a bin that lies East of it (which doesn't exist and is outside the region). You can get an AssertionError like so:

tinyamount = 0
lons = set()
for i, origin in enumerate(region.origins()):

    if origin[0] not in lons:
        lons.add(origin[0])
        continue
    
    assert i == region.get_index_of(
        origin[0] + tinyamount,
        origin[1] + tinyamount
    )

3. The complexity

But this numerical issue gets slightly more complicated than that:

a) Influence of the region

The region influences (i) the tinyamount that is just sufficient to bin correctly, and (ii) which coordinate component(s) must be corrected. For instance for Italy, this is sufficient:

region = csep.core.regions.italy_csep_region()
tinyamount = 2e-13  # Note #1: smaller than for California region
for i, origin in enumerate(region.origins()):

    assert i == region.get_index_of(
        origin[0],  # Note #2: adding `tinyamount` to the Longitude NOT needed (!)
        origin[1] + tinyamount
    )

I don't know why, but it's certainly related to numerical representations of values.

b) Influence of the floating-point format/precision

The just-sufficient tinyamount also depends on the floating-point format (i.e., precision). For instance, assuming the events have single precision (float32), this is necessary:

tinyamount = 1.6e-6  # for Italy region
tinyamount = 3.1e-6  # for California region
for i, origin in enumerate(region.origins()):

    assert i == region.get_index_of(
        origin[0].astype(np.float32) + tinyamount,  # (adding `tinyamount` not necessary for Italy region)
        origin[1].astype(np.float32) + tinyamount
    )

Summary

I tried to find out the just-sufficient threshold also as function of the corresponding .eps and .resolution property of np.finfo for the used precision (e.g., np.finfo(np.float64).eps):

Region Precision tinyamount .resolution * .eps *
California float64 1.0e-12 1000.0 ~4400.0
Italy float64 1.4e-13 140.0 ~630.0
California float32 3.1e-6 3.1 ~26.0
Italy float32 1.6e-6 1.6 ~13.0

Apparently, tinyamount not only depends on the region, but it also doesn't scale with the precision.

Suggested solution(s)

Some options (with increasing personal preference):

  1. add the required tinyamount to event coordinates;
  2. use bin1d_vec's tol argument, which is not used by the get_index_of wrapper;
  3. revise the tolerance section of bin1d_vec;
  4. simplify the tolerance section of bin1d_vec by only adding a tinyamount.

Option 1 is only a temporary (quick-)fix.

For Option 2, note that tol = tinyamount is not sufficient due to the follow-up 'absolute tolerance' section. It needs to be slightly higher, e.g., tol = tinyamount * 1.1 in our cases.

For Option 3, we could do:

-    a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
-    h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
-    p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
+#    tol_fact = numpy.finfo(numpy.float64).eps  # (current)
+    tol_fact = # NEW THRESHOLD NEEDED
+    a0_tol = numpy.abs(a0) * tol_fact
+    h_tol = numpy.abs(h) * tol_fact
+    p_tol = numpy.abs(p) * tol_fact

Compared to tol, tol_fact is relative; I tried to find out the necessary thresholds again:

Region Precision tol_fact .resolution * .eps *
California float64 4.14e-15 4.14 18.7
Italy float64 1.74e-15 1.74 7.9
California float32 2.38e-8 0.0238 0.2
Italy float32 2.35e-8 0.0235 0.2

It also doesn't scale with the precision of the floating-point format, so I don't see the benefit of this option.

For Option 4, we could revert the whole current tolerance section that was introduced in commit 949d342c to a prior state and extend it (keeping the tol argument):

-    a0_tol = numpy.abs(a0) * numpy.finfo(numpy.float64).eps
-    h_tol = numpy.abs(h) * numpy.finfo(numpy.float64).eps
-    p_tol = numpy.abs(p) * numpy.finfo(numpy.float64).eps
-
-
-    # absolute tolerance
-    if tol is None:
-        idx = numpy.floor((p + (p_tol + a0_tol) - a0) / (h - h_tol))
-    else:
-        idx = numpy.floor((p + (tol + a0_tol) - a0) / (h - h_tol))
+    tol = tol or 0
+
+    dtype = np.asarray(p).dtype
+    if dtype.name in ('float64', 'float32'):
+        float_tol = numpy.finfo(dtype).resolution * {
+            'float64': 2000,
+            'float32': 10,
+        }[dtype.name]
+        tol = max(tol, float_tol)
+
+    idx = numpy.floor((p + tol - a0) / h)

I added a safety factor of 2-3 compared to Table 1 just to be more on the safe side.

Btw: We could even always add numpy.finfo(numpy.float32).resolution * 10 = 1e-5 independent of the floating-point format, which is on the geographical order of ~1 meter. So instead of the above change, this one:

+    float_min_tol = 1e-5  # conservative (minimum) tolerance
+    tol = max(tol or 0, float_min_tol)
+    idx = numpy.floor((p + tol - a0) / h)

This would be the same as (and replace!) setting/fixing tol=0.00001 when binning the magnitude across various places in pyCSEP code.

The implication

If a catalog contains such "corner case" events, fixing this issue will change the test statistic – but only slightly.
Nevertheless, it would lead to irreproducible previous results (e.g., in my case for IGPE, ~at the 3 to 5th decimal digit).
Exactly such slight difference (compared to an independent binning implementation) was the reason for spotting this issue.



A personal note

Considering that event locations are sometimes very rough, it might have been a better idea to define grid cells so that their centers lie on rough coordinates (e.g., 13.5°), rather than their boundaries (13.45° / 13.55°); i.e., shift the grid by half a bin size in each direction.


P.S.

@khawajasim: Could QuadtreeGrid2D._find_location be affected by the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant