Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Inconsistencies in Cosine Distance Calculation for Near-Zero Length Vectors #320

Open
2 tasks done
tos-kamiya opened this issue Dec 12, 2023 · 16 comments
Open
2 tasks done
Assignees
Labels
bug Something isn't working invalid This doesn't seem right

Comments

@tos-kamiya
Copy link

tos-kamiya commented Dec 12, 2023

Describe the bug

I have observed inconsistent cosine distance values when working with vectors of very small magnitudes. When adding vectors to the index and then performing a search, the cosine distances between a query vector and these near-zero vectors varied unpredictably, showing values of 0, 1, or infinity.

Environment Details:

  • Usearch Version: 2.8.14
  • Python Version: 3.10.12
  • OS: Ubuntu 22.04.3 LTS
  • CPU: Intel(R) Xeon(R) W-2235 CPU @ 3.80GHz

Steps to reproduce

  1. Create an index with 2-dimensional vectors and 'cos' metric.
  2. Add vectors of the form [1.0e-x, 0.0] for x ranging from 10 to 29 to the index.
  3. Perform a search with the query vector [1.0, 0.0].

Code Example:

import ast
import numpy as np
from usearch.index import Index, Matches

index = Index(ndim=2, metric='cos', dtype='f32')

kvs = {}
for k in range(10, 30):
   v = np.array([ast.literal_eval("1.0e-%d" % k), 0.0])
   kvs[k] = v
   index.add(k, v)

query = np.array([1.0, 0.0])

matches: Matches = index.search(query, 100)
for m in matches:
   print("distance %s to %s %s: %g" % (query, m.key, kvs[m.key], m.distance))

Execution Example:

distance [1. 0.] to 10 [1.e-10 0.e+00]: 0
distance [1. 0.] to 11 [1.e-11 0.e+00]: 0
...
distance [1. 0.] to 29 [1.e-29 0.e+00]: 1
distance [1. 0.] to 19 [1.e-19 0.e+00]: inf
...

Expected behavior

Consistent and predictable cosine distance values for vectors, regardless of their magnitude.

USearch version

2.8.14 (Python bindings)

Operating System

Ubuntu 22.04.3

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tos-kamiya tos-kamiya added bug Something isn't working invalid This doesn't seem right labels Dec 12, 2023
@ashvardanian
Copy link
Contributor

@tos-kamiya, thanks for opening the issue! Can you please attach the outputs of print(index) - it will show details relevant for diagnostics?

I'd also recommend to use double-precision math if you want to work with such values, simply try passing "f64" in the constructor 🤗

@tos-kamiya
Copy link
Author

Thank you for your prompt response to my previous message. Following your suggestion, I have conducted further tests and am providing the requested details below.

The output of print(index) using the original code (f32 data type) is as follows:

usearch.Index(ScalarKind.F32 x 2, MetricKind.Cos, connectivity: 16, expansion: 128 & 64, 20 vectors in 2 levels, avx512 hardware acceleration)

I modified the code to use double-precision (f64), changing the range of the vectors to 1.0e-140 to 1.0e-180. The results were as follows:

  • With this higher precision, distances for vectors in the range 1.0e-140 to 1.0e-154 consistently returned 0.
  • Vectors from 1.0e-163 to 1.0e-179 yielded distances of 1.
  • Other vectors in between these ranges showed varying negative or positive distances close to 0.

Modified code:

import ast

import numpy as np
from usearch.index import Index, Matches

index = Index(
    ndim=2, # Define the number of dimensions in input vectors
    metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'
    dtype='f64', # Quantize to 'f16' or 'i8' if needed, default = 'f32'
)

kvs = {}
for k in range(140, 180):
    v = np.array([ast.literal_eval("1.0e-%d" % k), 0.0])
    kvs[k] = v
    index.add(k, v)

query = np.array([1.0, 0.0])

matches: Matches = index.search(query, 100)
for m in matches:
    print("distance %s to %s %s: %g" % (query, m.key, kvs[m.key], m.distance))

print(matches)

print(index)

Output:

distance [1. 0.] to 161 [1.e-161 0.e+000]: -0.00598771
distance [1. 0.] to 160 [1.e-160 0.e+000]: -5.56646e-06
distance [1. 0.] to 159 [1.e-159 0.e+000]: -6.25753e-07
distance [1. 0.] to 158 [1.e-158 0.e+000]: -8.17014e-09
distance [1. 0.] to 157 [1.e-157 0.e+000]: -1.80596e-11
distance [1. 0.] to 156 [1.e-156 0.e+000]: -7.67386e-13
distance [1. 0.] to 155 [1.e-155 0.e+000]: -1.55431e-15
distance [1. 0.] to 140 [1.e-140 0.e+000]: 0
distance [1. 0.] to 141 [1.e-141 0.e+000]: 0
...
distance [1. 0.] to 153 [1.e-153 0.e+000]: 0
distance [1. 0.] to 154 [1.e-154 0.e+000]: 0
distance [1. 0.] to 179 [1.e-179 0.e+000]: 1
distance [1. 0.] to 178 [1.e-178 0.e+000]: 1
...
distance [1. 0.] to 171 [1.e-171 0.e+000]: 1
distance [1. 0.] to 163 [1.e-163 0.e+000]: 1
usearch.Matches(40)
usearch.Index(ScalarKind.F64 x 2, MetricKind.Cos, connectivity: 16, expansion: 128 & 64, 40 vectors in 2 levels, avx512 hardware acceleration)

I hope this additional information helps in diagnosing the issue.

Thank you once again for your assistance.

@ashvardanian
Copy link
Contributor

@tos-kamiya thank you! Now the question is - how far do we want to go to achieve numerical stability.
It’s definitely possible, but may sacrifice accuracy. Coincidentally, a similar issue was raised today in SimSIMD. The issue originates here for f32 and here for f64.

Any chance you can help tune those epsilon params? That would help up to a million devices running those libraries 🤗

@ashvardanian ashvardanian self-assigned this Dec 12, 2023
@tos-kamiya
Copy link
Author

I've been reflecting on the discussion about the behavior of cosine distances with small norm vectors. Here are some thoughts:

  • Inf Values in f32 Computations: It seems that the inf values are likely due to dividing by zero when normalizing small vectors. When the norm of a vector is essentially zero, the normalization process leads to division by zero, resulting in inf values.

  • Handling Zero-Norm Vectors: For vectors that effectively have a zero norm, it might be more practical to directly set the cosine value (inner product) to zero instead of performing normalization that leads to inf values. This approach could prevent the generation of vectors with inf values.

  • Behavior in f64 Computations: The jump from 0 to 1 in cosine distances as vectors get smaller in f64 seems consistent. As the distance increases, it implies less similarity, which aligns with the expected behavior in practical applications.

  • Documentation Suggestion: I recommend documenting that for indexes with a cos metric, it's advisable not to add vectors with small norms to avoid these issues.

  • Epsilon Parameter: While I'm not entirely clear about the epsilon parameter you mentioned, introducing an eps parameter in the Index constructor could be a solution. This parameter would set the cosine to zero for vectors with norms smaller than eps, ensuring stability in calculations.

I hope these observations and suggestions are helpful for addressing the issue.

@ashvardanian
Copy link
Contributor

Hi @tos-kamiya! Documenting the expected behavior is indeed important, I'll try to improve there. Introducing an additional runtime epsilon parameter to distance functions, is, however, infeasible. All of the metrics must have identical signature. We just need to find the smallest constant that results in a non-zero square root and use it as an epsilon.

@turian
Copy link

turian commented Dec 18, 2023

@ashvardanian would it be a problem to make the choice of epsilon data-dependent? And use strong heuristics to determine which particular epsilon to use? (This could perhaps be determined by doing some brute-force/fuzz-testing-style tests.)

@ashvardanian
Copy link
Contributor

Yes, @turian, that might be problematic. We try to perform all the computations in a single forward pass over the vectors. Introducing the second pass to analyze the data would be too costly. I believe we only need to tune that constant a bit, reverse-engineering the minimum value, that after the rsqrt won't result in inf.

@lifthrasiir
Copy link

After some fiddling, I believe this bug is nothing to do with epsilon in SIMSIMD and actually a bigger bug in disguise: most SIMSIMD functions are not actually used because autovectorized functions can overtake them due to a wrong conditional. Therefore what's actually being called is metric_cos_gt::operator() which lacks any guard against near-zero norms. This can be also verified by comparing the reported distance with direct outputs from the simsimd package. Fixing either one---metric_cos_gt or metric_punned_t---does give a reasonable result.

@ashvardanian
Copy link
Contributor

@lifthrasiir it depends on how you compile the package. In Python, you can use index.hardware_acceleration to check if SimSIMD is used. On most modern hardware that should be the case.

Are you compiling from source? What hardware are you using? Have you passed the compilation parameter to enable SimSIMD? Check CONTRIBUTING.md for details on those.

@lifthrasiir
Copy link

My test environment was:

  • Xeon Gold 6348H @ 2.38GHz
  • Ubuntu 20.04.6 LTS
  • Python 3.10.13

Under this environment I confirmed that index.hardware_acceleration indeed equals "avx512", which should mean USEARCH_USE_SIMSIMD has been correctly set.

Of course this doesn't actually mean that the corresponding SimSIMD code is being used---I did confirm this by slightly altering the source code and running pip install -e . as suggested. (The current SimSIMD doesn't actually compile clean on Ubuntu 20.04, so I had to manually comment some f16 functions out. This shouldn't affect outcomes for f32 or f64 though.)

@turian
Copy link

turian commented Dec 29, 2023

@ashvardanian I was talking about a batch offline developer-only step, not a second pass during inference. Almost like a fuzz testing suite, but that generates appropriate float heuristics for the codebase.

@ashvardanian
Copy link
Contributor

@turian, thats worth considering. Let me know if there is a PoC solution you can contribute 🤗

@lifthrasiir
Copy link

I guess my comment was too succinct to illustrate the evident problem, so I'll elaborate:

    inline metric_punned_t(                                //
        std::size_t dimensions,                            //
        metric_kind_t metric_kind = metric_kind_t::l2sq_k, //
        scalar_kind_t scalar_kind = scalar_kind_t::f32_k) noexcept
        : raw_arg3_(dimensions), raw_arg4_(dimensions), dimensions_(dimensions), metric_kind_(metric_kind),
          scalar_kind_(scalar_kind) {

#if USEARCH_USE_SIMSIMD
        if (!configure_with_simsimd())
            configure_with_auto_vectorized();
#endif
        configure_with_auto_vectorized();

        if (scalar_kind == scalar_kind_t::b1x8_k)
            raw_arg3_ = raw_arg4_ = divide_round_up<CHAR_BIT>(dimensions_);
    }

I hope it's clear enough that configure_with_auto_vectorized can get called twice when USEARCH_USE_SIMSIMD is defined, and configure_with_auto_vectorized will always overwrite raw_ptr_.

Even when you don't have this fact beforehand (like, when I started out), distance([1, 0], [1e-10, 0]) returning 0 should have been already suspicious. If the SimSIMD routine was indeed in use, the calculation should go like:

ab = 1 * 1e-10 = 1e-10
a2 = 1 * 1 = 1
b2 = 1e-10 * 1e-10 ≈ 1e-20
rsqrt_a2 = rsqrt14(1 + 1e-9) ≈ rsqrt14(1)
rsqrt_b2 = rsqrt14(1e-20 + 1e-9) ≈ rsqrt14(1e-9)
distance ≈ 1 - 1e-10 * rsqrt14(1) * rsqrt14(1e-9) ≈ 1 - 1e-10 * 3.16e4 ≈ 1

In the other words, there is absolutely no reason for this to return 0 or infinity even with a presence of floating-point errors because the epsilon should have masked near-zero values out. I have also manually verified that the rsqrt14 approximation itself can't cause this, because its implementation is independent of the binary exponent (e.g. rsqrt14(1.5) and rsqrt14(6) would do the identical calculations). That's why I came to consider a possibility that this code was not actually used at all.

In comparison, the metric_cos_gt code would just return 1 - ab / (sqrt(a2) * sqrt(b2)) = 1 - 1e-10 / (sqrt(1) * sqrt(1e-20)) ≈ 1 - 1e-10 / 1e-10 = 0, as observed. It is even possible to sqrt(a2) * sqrt(b2) to underflow without a2 or b2 are not (yet) zero, especially when -ffast-math is set and FTZ/DAZ might be in effect, so it can also return both signs of infinities depending on the sign of ab. (In my opinion the whole zero-detection code is flawed for this reason.)

@ashvardanian
Copy link
Contributor

Oh, thank you, @lifthrasiir - I didn't realize you meant the macro-conditional, my bad 🤦‍♂️ Would you like to author a patch PR since you are the one who noticed that?

@lifthrasiir
Copy link

@ashvardanian Glad it worked this time :-p I think it will have a performance impact, which may well be negative, in addition to obvious incompatibility issues (for example, a saved database will probably no longer work I guess?). I can commentate evident bugs but can't decide what to do now---I even don't use usearch myself---, so it's probably better for you to fix this with all decision makings.

(Aside: did you check if -ffast-math is correctly localized to the usearch library proper by the way? Because glibc apparently enables FTZ and DAZ as a part of its C runtime and can have a global effect AFAIK. And my cursory reading didn't detect any complex calculation that hasn't been rewritten and can be benefited from -ffast-math.)

@ashvardanian
Copy link
Contributor

@lifthrasiir, there shouldn’t be compatibility or any other issues. This class was significantly refactored in the last releases when the macro condition was broken. Prior to this, it definitely worked, and we have enough benchmarks coverage in SimSIMD to suggest improvements over autovectorized code.

As for fast-math settings, I agree, that with SimSIMD back ON, there shouldn’t be anything left to gain from that flag 🤗

@ashvardanian ashvardanian mentioned this issue Jan 9, 2024
2 tasks
ashvardanian added a commit that referenced this issue Jan 24, 2024
ashvardanian pushed a commit that referenced this issue Jan 24, 2024
## [2.8.16](v2.8.15...v2.8.16) (2024-01-24)

### Docs

* Downloads numbers ([13cc624](13cc624))

### Fix

* SimSIMD dispatch ([c8dc3b7](c8dc3b7)), closes [#320](#320)

### Make

* Fix Node version for SemVer ([a68fca6](a68fca6))
* Upgrade dependencies ([6ab2150](6ab2150))
* Upgrade Node environment ([f5b6750](f5b6750))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants