-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Inconsistencies in Cosine Distance Calculation for Near-Zero Length Vectors #320
Comments
@tos-kamiya, thanks for opening the issue! Can you please attach the outputs of I'd also recommend to use double-precision math if you want to work with such values, simply try passing |
Thank you for your prompt response to my previous message. Following your suggestion, I have conducted further tests and am providing the requested details below. The output of print(index) using the original code (f32 data type) is as follows:
I modified the code to use double-precision (f64), changing the range of the vectors to 1.0e-140 to 1.0e-180. The results were as follows:
Modified code: import ast
import numpy as np
from usearch.index import Index, Matches
index = Index(
ndim=2, # Define the number of dimensions in input vectors
metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'
dtype='f64', # Quantize to 'f16' or 'i8' if needed, default = 'f32'
)
kvs = {}
for k in range(140, 180):
v = np.array([ast.literal_eval("1.0e-%d" % k), 0.0])
kvs[k] = v
index.add(k, v)
query = np.array([1.0, 0.0])
matches: Matches = index.search(query, 100)
for m in matches:
print("distance %s to %s %s: %g" % (query, m.key, kvs[m.key], m.distance))
print(matches)
print(index) Output:
I hope this additional information helps in diagnosing the issue. Thank you once again for your assistance. |
@tos-kamiya thank you! Now the question is - how far do we want to go to achieve numerical stability. Any chance you can help tune those epsilon params? That would help up to a million devices running those libraries 🤗 |
I've been reflecting on the discussion about the behavior of cosine distances with small norm vectors. Here are some thoughts:
I hope these observations and suggestions are helpful for addressing the issue. |
Hi @tos-kamiya! Documenting the expected behavior is indeed important, I'll try to improve there. Introducing an additional runtime |
@ashvardanian would it be a problem to make the choice of epsilon data-dependent? And use strong heuristics to determine which particular epsilon to use? (This could perhaps be determined by doing some brute-force/fuzz-testing-style tests.) |
Yes, @turian, that might be problematic. We try to perform all the computations in a single forward pass over the vectors. Introducing the second pass to analyze the data would be too costly. I believe we only need to tune that constant a bit, reverse-engineering the minimum value, that after the |
After some fiddling, I believe this bug is nothing to do with |
@lifthrasiir it depends on how you compile the package. In Python, you can use Are you compiling from source? What hardware are you using? Have you passed the compilation parameter to enable SimSIMD? Check CONTRIBUTING.md for details on those. |
My test environment was:
Under this environment I confirmed that Of course this doesn't actually mean that the corresponding SimSIMD code is being used---I did confirm this by slightly altering the source code and running |
@ashvardanian I was talking about a batch offline developer-only step, not a second pass during inference. Almost like a fuzz testing suite, but that generates appropriate float heuristics for the codebase. |
@turian, thats worth considering. Let me know if there is a PoC solution you can contribute 🤗 |
I guess my comment was too succinct to illustrate the evident problem, so I'll elaborate: inline metric_punned_t( //
std::size_t dimensions, //
metric_kind_t metric_kind = metric_kind_t::l2sq_k, //
scalar_kind_t scalar_kind = scalar_kind_t::f32_k) noexcept
: raw_arg3_(dimensions), raw_arg4_(dimensions), dimensions_(dimensions), metric_kind_(metric_kind),
scalar_kind_(scalar_kind) {
#if USEARCH_USE_SIMSIMD
if (!configure_with_simsimd())
configure_with_auto_vectorized();
#endif
configure_with_auto_vectorized();
if (scalar_kind == scalar_kind_t::b1x8_k)
raw_arg3_ = raw_arg4_ = divide_round_up<CHAR_BIT>(dimensions_);
} I hope it's clear enough that Even when you don't have this fact beforehand (like, when I started out),
In the other words, there is absolutely no reason for this to return 0 or infinity even with a presence of floating-point errors because the epsilon should have masked near-zero values out. I have also manually verified that the rsqrt14 approximation itself can't cause this, because its implementation is independent of the binary exponent (e.g. rsqrt14(1.5) and rsqrt14(6) would do the identical calculations). That's why I came to consider a possibility that this code was not actually used at all. In comparison, the |
Oh, thank you, @lifthrasiir - I didn't realize you meant the macro-conditional, my bad 🤦♂️ Would you like to author a patch PR since you are the one who noticed that? |
@ashvardanian Glad it worked this time :-p I think it will have a performance impact, which may well be negative, in addition to obvious incompatibility issues (for example, a saved database will probably no longer work I guess?). I can commentate evident bugs but can't decide what to do now---I even don't use usearch myself---, so it's probably better for you to fix this with all decision makings. (Aside: did you check if |
@lifthrasiir, there shouldn’t be compatibility or any other issues. This class was significantly refactored in the last releases when the macro condition was broken. Prior to this, it definitely worked, and we have enough benchmarks coverage in SimSIMD to suggest improvements over autovectorized code. As for fast-math settings, I agree, that with SimSIMD back ON, there shouldn’t be anything left to gain from that flag 🤗 |
## [2.8.16](v2.8.15...v2.8.16) (2024-01-24) ### Docs * Downloads numbers ([13cc624](13cc624)) ### Fix * SimSIMD dispatch ([c8dc3b7](c8dc3b7)), closes [#320](#320) ### Make * Fix Node version for SemVer ([a68fca6](a68fca6)) * Upgrade dependencies ([6ab2150](6ab2150)) * Upgrade Node environment ([f5b6750](f5b6750))
Describe the bug
I have observed inconsistent cosine distance values when working with vectors of very small magnitudes. When adding vectors to the index and then performing a search, the cosine distances between a query vector and these near-zero vectors varied unpredictably, showing values of 0, 1, or infinity.
Environment Details:
Steps to reproduce
[1.0e-x, 0.0]
for x ranging from 10 to 29 to the index.[1.0, 0.0]
.Code Example:
Execution Example:
Expected behavior
Consistent and predictable cosine distance values for vectors, regardless of their magnitude.
USearch version
2.8.14 (Python bindings)
Operating System
Ubuntu 22.04.3
Hardware architecture
x86
Which interface are you using?
Python bindings
Contact Details
[email protected]
Is there an existing issue for this?
Code of Conduct
The text was updated successfully, but these errors were encountered: