Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distance metric selection #35

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pierd
Copy link

@pierd pierd commented Feb 9, 2023

Addresses #20

Changes

  • separate crate distance-metrics with Euclid and Cosine metrics
  • adds distance_metric field to Config and DistanceMetric enum class
  • adds benchmarks for metric implementations and for building HNSW from Python
  • turns FloatArray into a generic struct taking metric as a parameter
  • introduces HnswWithMetric and HnswMapWithMetric intermediate enums to hide generic parameters from the classes exposed to Python and to have static dispatch internally

Benchmark

The only noticeable change (regression) is from using the original SIMD implementation to using the qdrant one. Time spent in Python integration benchmark grows ~12% (175 msec -> 196 msec) which roughly corresponds the growth in time spent in distance metrics microbenchmark +19% (21 ns -> 25 ns).

Note that these benchmarks are very simple and most likely not reflect real world cases. It would be interesting to see the results for aarch64 since previously we didn't have SIMD implementation for it.

Distance metrics

Distance metrics microbenchmark results (on AMD Ryzen 9 5900HS 3.30 GHz):

test legacy               ... bench:          21 ns/iter (+/- 0)
test non_simd             ... bench:         183 ns/iter (+/- 2)
test simple<CosineMetric> ... bench:          22 ns/iter (+/- 0)
test simple<EuclidMetric> ... bench:          25 ns/iter (+/- 0)

Python integration

version result
original (before any changes) 10 loops, best of 5: 175 msec per loop
after refactoring for enum dispatch to different metrics (but still using the same distance metric implementation) 10 loops, best of 5: 175 msec per loop
after switching to qdrant's Euclid metric implementation 10 loops, best of 5: 196 msec per loop
after adding Cosine metric 10 loops, best of 5: 196 msec per loop

Known issues

  • Since HnswWithMetric and HnswMapWithMetric are introduced, the serialization format has changed. Objects serialized with the previous version won't work with this one.

distance-metrics/src/simple.rs Outdated Show resolved Hide resolved
@hartshorne hartshorne requested a review from djc February 9, 2023 19:49
Copy link
Collaborator

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, nice work! I did a first round of review, but there's a lot here to digest. I'm most curious about the benchmarking results, I'm guessing the interposition of Metric in its current state would cause performance to regress substantially.

FYI, we usually like to rebase-merge atomic commits. For example, the way I would submit this would probably be to do the refactoring first, then add the other SIMD implementations (one commit per architecture) for the euclid stuff, then add one new metric per commit. No need to do that now if you prefer to keep it like this (since doing it retroactively can be a bit of a pain), although it might still be nice to squash all the follow-up commits into the first one.

@@ -1,5 +1,5 @@
[workspace]
members = ["instant-distance", "instant-distance-py"]
members = ["distance-metrics", "instant-distance", "instant-distance-py"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what your motivation for adding a new crate for this is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate mostly because I couldn't get benchmark to access instant-distance-py internals and the metrics felt to specific for instant-distance crate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another try. There are 2 things that prevent me from adding benches directly to instant-distance-py:

  1. crate type is cdylib - that one is easy we could have 2 types (cdylib and regular lib),
  2. the name is instance-distance, just like the other crate - AFAIK it makes it impossible to use both crates at the same time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's just change the crate name to instant-distance-py in the Rust metadata (but should verify that the Python name will still be instant-distance), and then if we add the extra type we wouldn't need the extra crate anymore, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible if we add a tiny python wrapper module. That's to force maturin to build wheel for mixed rust/python project. Without the wrapper maturin uses lib.name as top level module name.

simple::<CosineMetric>
);

fn legacy(bench: &mut Bencher) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be curious to see some benchmarking results in this PR for the legacy approach vs your new implementations!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most recent results:

test legacy               ... bench:          21 ns/iter (+/- 0)
test non_simd             ... bench:         183 ns/iter (+/- 2)
test simple<CosineMetric> ... bench:          22 ns/iter (+/- 0)
test simple<EuclidMetric> ... bench:          25 ns/iter (+/- 0)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think legacy is euclid except it doesn't bother taking the square root, right? If it's still 4ns faster than Euclid, we should maybe expose it under a name other than EuclidMetric, as I think we'd want to stick to it (since the difference between the square root and the squared value doesn't usually matter for our use case?).

Did you run any tests to show that the platform-specific implementations have the same result (or close to it) as the non-SIMD implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the legacy is euclid but with square root. I believe we don't care about having squared values. If I drop the square root calculation then they are almost the same (2-5% difference):

test legacy               ... bench:          54 ns/iter (+/- 6)
test metric<CosineMetric> ... bench:          56 ns/iter (+/- 11)
test metric<EuclidMetric> ... bench:          55 ns/iter (+/- 6)
test non_simd             ... bench:         441 ns/iter (+/- 40)

(the numbers are higher because I'm on battery and the CPU is throttled)

We could use that since it doesn't have dimensions limitation. Another idea would be to have the current metric (as for example Euclid300) for the use case of 300 dimensions.

distance-metrics/src/legacy.rs Outdated Show resolved Hide resolved
distance-metrics/src/lib.rs Outdated Show resolved Hide resolved
distance-metrics/src/metric.rs Outdated Show resolved Hide resolved
instant-distance-py/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 47 to 57
#[derive(Deserialize, Serialize)]
enum HnswMapWithMetric {
Euclid(instant_distance::HnswMap<FloatArray<EuclidMetric>, MapValue>),
Cosine(instant_distance::HnswMap<FloatArray<CosineMetric>, MapValue>),
DotProduct(instant_distance::HnswMap<FloatArray<DotProductMetric>, MapValue>),
}

// Helper macro for dispatching to inner implementation
macro_rules! impl_for_each_hnsw_with_metric {
($type:ident, $instance:expr, $inner:ident, $($tokens:tt)+) => {
match $instance {
$type::Euclid($inner) => {
$($tokens)+
}
$type::Cosine($inner) => {
$($tokens)+
}
$type::DotProduct($inner) => {
$($tokens)+
}
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: let's move this down below the Hnsw.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a macro it has to be defined before the first use (in HnswMap::search).

instant-distance-py/src/lib.rs Show resolved Hide resolved
instant-distance-py/test/test.py Outdated Show resolved Hide resolved
instant-distance-py/src/lib.rs Outdated Show resolved Hide resolved
@pierd
Copy link
Author

pierd commented Feb 11, 2023

Hey, nice work! I did a first round of review, but there's a lot here to digest. I'm most curious about the benchmarking results, I'm guessing the interposition of Metric in its current state would cause performance to regress substantially.

FYI, we usually like to rebase-merge atomic commits. For example, the way I would submit this would probably be to do the refactoring first, then add the other SIMD implementations (one commit per architecture) for the euclid stuff, then add one new metric per commit. No need to do that now if you prefer to keep it like this (since doing it retroactively can be a bit of a pain), although it might still be nice to squash all the follow-up commits into the first one.

Yeah, I think I'll do as you say and split it in more meaningful commits. I'm used to squash merging the whole PR branch so it lands in main/master as a single commit.

One thing I did that makes this PR look bigger than it is, I tried to use as much of qdrant implementation with minimal changes but after second thought it will fit better if I just use the SIMD implementations and rework the rest.

@pierd pierd force-pushed the distance-metrics branch 2 times, most recently from 264dd95 to dcb69c0 Compare February 15, 2023 21:00
Comment on lines +44 to +56
// Helper macro for dispatching to inner implementation
macro_rules! impl_for_each_hnsw_with_metric {
($type:ident, $instance:expr, $inner:ident, $($tokens:tt)+) => {
match $instance {
$type::Euclid($inner) => {
$($tokens)+
}
$type::Cosine($inner) => {
$($tokens)+
}
}
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: even though we have to define the macro here before using it, I'd like to move the DistanceMetric closer to the bottom of the module (probably below MapValue?), to fit it in with the top-down order.

inner: HnswMapWithMetric,
}

#[derive(Deserialize, Serialize)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: in order to keep the struct HnswMap close to the impl below, I'd prefer to move the HnswMapWithMetric below the HnswMap impl block (above the Hnsw).

inner: HnswWithMetric,
}

#[derive(Deserialize, Serialize)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -1,5 +1,5 @@
[workspace]
members = ["instant-distance", "instant-distance-py"]
members = ["distance-metrics", "instant-distance", "instant-distance-py"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's just change the crate name to instant-distance-py in the Rust metadata (but should verify that the Python name will still be instant-distance), and then if we add the extra type we wouldn't need the extra crate anymore, right?

simple::<CosineMetric>
);

fn legacy(bench: &mut Bencher) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think legacy is euclid except it doesn't bother taking the square root, right? If it's still 4ns faster than Euclid, we should maybe expose it under a name other than EuclidMetric, as I think we'd want to stick to it (since the difference between the square root and the squared value doesn't usually matter for our use case?).

Did you run any tests to show that the platform-specific implementations have the same result (or close to it) as the non-SIMD implementation?

#[derive(Clone, Deserialize, Serialize)]
struct FloatArray(#[serde(with = "BigArray")] [f32; DIMENSIONS]);
struct FloatArray<M> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels to me like the change here from parametrizing the metric is separate from moving from a fixed-size array to a Vec, so I'd like those changes to be in separate commits.

Also, once this is changed to hold a Vec, should probably also change the name, since it's no longer actually an array? (Maybe FloatVector?)

And in particular, I'm curious about the performance impact of the addtional allocations (and pointer indirection/cache) of using a Vec here over an array. For that we probably need more of a macro-benchmark, do you have a data set you've tried for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll change the name.

I don't have any real dataset, I've been using random numbers so far. I could potentially use the dataset from translate.py example or maybe something form ann-benchmark (although from what I've seen those are usually quite big).

.iter()?
.map(|val| val.and_then(|v| v.extract::<f32>()))
.collect::<Result<_, _>>()?;
Ok(Self::from(array))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ditch the array binding in favor of yielding Self::from(...collect()) directly, you don't need the type annotation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be too many steps for rustc to figure this out. I still need type annotation on collect:

    Ok(Self::from(
            value
                .iter()?
                .map(|val| val.and_then(|v| v.extract::<f32>()))
                .collect::<Result<Vec<f32>, _>>()?,
        ))

If I remove even Vec<f32> part then it fails to compile:

error[E0277]: the trait bound `FloatVector<M>: From<()>` is not satisfied
   --> instant-distance-py/src/lib.rs:431:13
    |
430 |           Ok(Self::from(
    |              ---------- required by a bound introduced by this call
431 | /             value
432 | |                 .iter()?
433 | |                 .map(|val| val.and_then(|v| v.extract::<f32>()))
434 | |                 .collect::<Result<_, _>>()?,
    | |___________________________________________^ the trait `From<()>` is not implemented for `FloatVector<M>`
    |
    = help: the trait `From<Vec<f32>>` is implemented for `FloatVector<M>`

error[E0277]: a value of type `()` cannot be built from an iterator over elements of type `f32`
    --> instant-distance-py/src/lib.rs:431:13
     |
431  | /             value
432  | |                 .iter()?
433  | |                 .map(|val| val.and_then(|v| v.extract::<f32>()))
     | |________________________________________________________________^ value of type `()` cannot be built from `std::iter::Iterator<Item=f32>`
434  |                   .collect::<Result<_, _>>()?,
     |                    ------- required by a bound introduced by this call
     |
     = help: the trait `FromIterator<f32>` is not implemented for `()`
     = help: the trait `FromIterator<()>` is implemented for `()`
     = note: required for `Result<(), PyErr>` to implement `FromIterator<Result<f32, PyErr>>`
note: required by a bound in `collect`
    --> /home/kuba/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:1832:19
     |
1832 |     fn collect<B: FromIterator<Self::Item>>(self) -> B
     |                   ^^^^^^^^^^^^^^^^^^^^^^^^ required by this bound in `collect`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `instant-distance-py` due to 2 previous errors

I get the same error if I remove the whole type annotation.

}
}

impl Point for FloatArray {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this commit still mixes the adding/changing of a lot of code with moving it around. Usually I'd request a separate commit that just refactors the code so that the implementation here is moved into its new place (introducing the Metric trait) and the code around it is left more or less unchanged, so that it's easier to review these things separately. So then I'd probably end up with:

  • Implement Metric trait and parametrize FloatArray with it
  • Add new Metric types and impls
  • Change FloatArray to FloatVector

Comment on lines +146 to +183
#[cfg(test)]
mod tests {
#[test]
fn test_spaces_avx() {
use super::*;
use crate::*;

if is_x86_feature_detected!("avx") && is_x86_feature_detected!("fma") {
let v1: Vec<f32> = vec![
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
26., 27., 28., 29., 30., 31.,
];
let v2: Vec<f32> = vec![
40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
56., 57., 58., 59., 60., 61.,
];

let euclid_simd = unsafe { euclid_distance_avx(&v1, &v2) };
let euclid = euclid_distance(&v1, &v2);
assert_eq!(euclid_simd, euclid);

let dot_simd = unsafe { dot_similarity_avx(&v1, &v2) };
let dot = dot_similarity(&v1, &v2);
assert_eq!(dot_simd, dot);

let mut v1 = v1;
let mut v1_copy = v1.clone();
unsafe { cosine_preprocess_avx(&mut v1) };
cosine_preprocess(&mut v1_copy);
assert_eq!(v1, v1_copy);
} else {
println!("avx test skipped");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it would be nice to split the addition of test code between the commits that add the test code (or alternatively, add it separately at the end).

#[derive(Copy, Clone)]
enum DistanceMetric {
Euclid,
Cosine,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we don't actually have a Cosine implementation at this point, we should not introduce this variant here until the commit where we add the backing implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants