Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sparse vector benchmarks #4163

Merged
merged 3 commits into from
May 7, 2024
Merged

Update sparse vector benchmarks #4163

merged 3 commits into from
May 7, 2024

Conversation

xzfc
Copy link
Contributor

@xzfc xzfc commented May 3, 2024

This PR brings a new set of benchmarks to segment and sparse crates, based on downloadable datasets.

These benchmarks would serve as a baseline to evaluate the performance of the new sparse index format (#4143).

There were a few problems with the current benchmark in the segment crate:

  1. They were based on uniform random data, whereas the distribution of the real data usually follows Zipf's law distribution.
  2. The same query vector is reused across all benchmark iterations.
    We don't want to let CPU caches do a favor to less memory-efficient algorithms.
  3. Segment crate is slow to compile. It makes it hard to iterate on the benchmarks.

Changes

This PR contains the following changes:

  1. A new crate lib/common/dataset to download testing datasets during benchmarking.
    1. Downloaded datasets are stored in target/datasets directory.
    2. The total size of unpacked datasets is about 10 GB.
    3. Datasets are downloaded through Rust code.
      Maybe sh -c 'wget $URL | gunzip > $FILE' would simplify everything, but I've tried to make it as hassle-free as possible (think of Windows users).
    4. Used the same datasets as in sparse-vectors-experiments, and sparse-vectors-benchmark.
      I.e. ANN Challenge (based on MS MARCO), and Splade/Wiki Movie Plots.
  2. Code to load sparse datasets from files in JSONL and CSR formats.
    1. lib/sparse/src/index/loaders.rs
  3. A new set of benchmarks in the sparse crate.
    1. Datasets: random-50k, random-500k, ann-1M, ann-full-25pct, movies.
    2. Two kinds of query vectors are used: original and "hottest".
      The hottest vectors are short (4 or 5 elements) and contain the most frequent index in the dataset.
      Hottest vectors are the best-case scenario for pruning optimization.
    3. Since the sparse crate is tiny and lacks of dependencies, these benchmarks are fast to compile.
  4. Updated benchmark sparse_index_search in the segment crate.
    1. Datasets: random-50k, ann-1M.
    2. No more variations since this test is too long already.
    3. The results below also contain random-500k, but I've dropped them from this PR since they're not so interesting.
  5. Adds tick_progress: impl FnMut() to segment::index::VectorIndex::build_index() and related places.
    1. It is used to display fancy progress bars during the benchmarks.
    2. In the production code, it should be a no-op: empty closures || () get optimized out.

Results

I've used these benchmarks to check whether the pruning optimization gives any performance benefits nowadays.
Turns out, it doesn't, for tested datasets.
I think it's safe to assume that it is safe to drop this optimization.
(Note: this PR doesn't disable pruning)

segment benchmark t t, no pruning
random-50k/mmap-inverted-index-search 893.92 µs 899.18 µs
random-50k/inverted-index-search 898.14 µs 897.75 µs
random-50k/inverted-index-filtered-plain 760.75 ms 755.98 ms
random-50k/inverted-index-filtered-payload-index 925.15 µs 915.60 µs
random-50k/plain-filtered-payload-index 1.3514 s 1.3971 s
random-500k/mmap-inverted-index-search 8.3151 ms 8.2603 ms
random-500k/inverted-index-search 8.3801 ms 8.3348 ms
random-500k/inverted-index-filtered-payload-index 8.3642 ms 8.3071 ms
ann-1M/mmap-inverted-index-search 13.570 ms 13.606 ms
ann-1M/inverted-index-search 13.737 ms 13.620 ms
ann-1M/inverted-index-filtered-payload-index 13.048 ms 13.004 ms
sparse benchmark t t, no pruning
random-50k/basic 847.29 µs 844.31 µs
random-50k/hottest 183.18 µs 185.32 µs
random-500k/basic 7.9571 ms 7.8274 ms
random-500k/hottest 1.5880 ms 1.6142 ms
ann-1M/basic 12.933 ms 12.546 ms
ann-1M/hottest 8.8609 ms 8.7062 ms
ann-full-25pct/basic 31.241 ms 29.914 ms
ann-full-25pct/hottest 19.734 ms 19.273 ms
movies/basic 4.0528 ms 4.0185 ms
movies/hottest 213.15 µs 204.63 µs
cargo bench -p segment
Benchmarking sparse_vector_index_search/random-50k/mmap-inverted-index-search: Collecting 10 samples in sparse_vector_index_search/random-50k/mmap-inverted-index-search
                        time:   [891.59 µs 893.92 µs 896.44 µs]
                        change: [+3.6161% +4.6727% +5.6024%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking sparse_vector_index_search/random-50k/inverted-index-search: Collecting 10 samples in estimsparse_vector_index_search/random-50k/inverted-index-search
                        time:   [895.10 µs 898.14 µs 900.73 µs]
                        change: [+1.2692% +3.0670% +4.5270%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-plain: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.4s.
Benchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-plain: Collecting 10 samples sparse_vector_index_search/random-50k/inverted-index-filtered-plain
                        time:   [409.62 ms 760.75 ms 1.1357 s]
                        change: [-66.748% -36.982% +7.4210%] (p = 0.11 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-payload-index: Warming up forBenchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-payload-index: Collecting 10 sparse_vector_index_search/random-50k/inverted-index-filtered-payload-index
                        time:   [921.65 µs 925.15 µs 929.20 µs]
                        change: [-1.0046% -0.1212% +0.8004%] (p = 0.81 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/random-50k/plain-filtered-payload-index: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.7s.
Benchmarking sparse_vector_index_search/random-50k/plain-filtered-payload-index: Collecting 10 samples isparse_vector_index_search/random-50k/plain-filtered-payload-index
                        time:   [1.1555 s 1.3514 s 1.5005 s]
                        change: [-19.308% +1.2513% +37.489%] (p = 0.93 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

Benchmarking sparse_vector_index_search/random-500k/mmap-inverted-index-search: Collecting 10 samples insparse_vector_index_search/random-500k/mmap-inverted-index-search
                        time:   [8.1384 ms 8.3151 ms 8.4900 ms]
                        change: [-1.6193% +2.6919% +6.9615%] (p = 0.25 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/random-500k/inverted-index-search: Collecting 10 samples in estisparse_vector_index_search/random-500k/inverted-index-search
                        time:   [8.2388 ms 8.3801 ms 8.5017 ms]
                        change: [-0.8878% +1.9281% +4.7460%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking sparse_vector_index_search/random-500k/inverted-index-filtered-payload-index: Warming up foBenchmarking sparse_vector_index_search/random-500k/inverted-index-filtered-payload-index: Collecting 10sparse_vector_index_search/random-500k/inverted-index-filtered-payload-index
                        time:   [8.2084 ms 8.3642 ms 8.4518 ms]
                        change: [-8.6079% -4.9434% -1.3346%] (p = 0.02 < 0.05)
                        Performance has improved.

Benchmarking sparse_vector_index_search/ann-1M/mmap-inverted-index-search: Collecting 10 samples in estisparse_vector_index_search/ann-1M/mmap-inverted-index-search
                        time:   [13.208 ms 13.570 ms 14.066 ms]
Benchmarking sparse_vector_index_search/ann-1M/inverted-index-search: Collecting 10 samples in estimatedsparse_vector_index_search/ann-1M/inverted-index-search
                        time:   [12.836 ms 13.737 ms 14.939 ms]
Benchmarking sparse_vector_index_search/ann-1M/inverted-index-filtered-payload-index: Warming up for 3.0Benchmarking sparse_vector_index_search/ann-1M/inverted-index-filtered-payload-index: Collecting 10 sampsparse_vector_index_search/ann-1M/inverted-index-filtered-payload-index
                        time:   [12.586 ms 13.048 ms 13.545 ms]
cargo bench -p segment; with pruning disabled
Gnuplot not found, using plotters backend
Benchmarking sparse_vector_index_search/random-50k/mmap-inverted-index-search: Collecting 10 samples in sparse_vector_index_search/random-50k/mmap-inverted-index-search
                        time:   [896.94 µs 899.18 µs 901.82 µs]
                        change: [-0.1686% +0.9018% +2.2975%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking sparse_vector_index_search/random-50k/inverted-index-search: Collecting 10 samples in estimsparse_vector_index_search/random-50k/inverted-index-search
                        time:   [894.43 µs 897.75 µs 901.29 µs]
                        change: [-0.9302% -0.0494% +0.8305%] (p = 0.91 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-plain: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.4s.
Benchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-plain: Collecting 10 samples sparse_vector_index_search/random-50k/inverted-index-filtered-plain
                        time:   [404.34 ms 755.98 ms 1.1322 s]
                        change: [-52.430% -0.6268% +116.57%] (p = 0.99 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-payload-index: Warming up forBenchmarking sparse_vector_index_search/random-50k/inverted-index-filtered-payload-index: Collecting 10 sparse_vector_index_search/random-50k/inverted-index-filtered-payload-index
                        time:   [911.78 µs 915.60 µs 919.76 µs]
                        change: [-1.7842% -1.0195% -0.2468%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Benchmarking sparse_vector_index_search/random-50k/plain-filtered-payload-index: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 11.1s.
Benchmarking sparse_vector_index_search/random-50k/plain-filtered-payload-index: Collecting 10 samples isparse_vector_index_search/random-50k/plain-filtered-payload-index
                        time:   [1.1897 s 1.3971 s 1.5596 s]
                        change: [-13.558% +3.3816% +24.795%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

Benchmarking sparse_vector_index_search/random-500k/mmap-inverted-index-search: Collecting 10 samples insparse_vector_index_search/random-500k/mmap-inverted-index-search
                        time:   [8.0907 ms 8.2603 ms 8.4144 ms]
                        change: [-4.0694% -0.6649% +2.8346%] (p = 0.73 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/random-500k/inverted-index-search: Collecting 10 samples in estisparse_vector_index_search/random-500k/inverted-index-search
                        time:   [8.1963 ms 8.3348 ms 8.4549 ms]
                        change: [-3.2167% -0.5233% +2.5681%] (p = 0.74 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking sparse_vector_index_search/random-500k/inverted-index-filtered-payload-index: Warming up foBenchmarking sparse_vector_index_search/random-500k/inverted-index-filtered-payload-index: Collecting 10sparse_vector_index_search/random-500k/inverted-index-filtered-payload-index
                        time:   [8.1441 ms 8.3071 ms 8.4003 ms]
                        change: [-3.0552% -0.7407% +1.5853%] (p = 0.56 > 0.05)
                        No change in performance detected.

Benchmarking sparse_vector_index_search/ann-1M/mmap-inverted-index-search: Collecting 10 samples in estisparse_vector_index_search/ann-1M/mmap-inverted-index-search
                        time:   [13.237 ms 13.606 ms 14.108 ms]
                        change: [-4.8955% -0.0218% +5.0834%] (p = 0.99 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/ann-1M/inverted-index-search: Collecting 10 samples in estimatedsparse_vector_index_search/ann-1M/inverted-index-search
                        time:   [12.714 ms 13.620 ms 14.832 ms]
                        change: [-8.3499% -0.9143% +6.8954%] (p = 0.82 > 0.05)
                        No change in performance detected.
Benchmarking sparse_vector_index_search/ann-1M/inverted-index-filtered-payload-index: Warming up for 3.0Benchmarking sparse_vector_index_search/ann-1M/inverted-index-filtered-payload-index: Collecting 10 sampsparse_vector_index_search/ann-1M/inverted-index-filtered-payload-index
                        time:   [12.545 ms 13.004 ms 13.495 ms]
                        change: [-4.4284% -0.3857% +4.0547%] (p = 0.87 > 0.05)
                        No change in performance detected.
cargo bench -p sparse
search/random-50k/basic time:   [844.18 µs 847.29 µs 850.39 µs]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
Hottest id: 482 (elements: 1091), average elements: 376.42551481667505
search/random-50k/hottest
                        time:   [183.05 µs 183.18 µs 183.33 µs]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

search/random-500k/basic
                        time:   [7.8235 ms 7.9571 ms 8.0918 ms]
Hottest id: 97 (elements: 10261), average elements: 3755.034260526931
Benchmarking search/random-500k/hottest: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
search/random-500k/hottest
                        time:   [1.5854 ms 1.5880 ms 1.5905 ms]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

████████████████████████████████████████████████████████████████████████████████████████ 1000000/1000000search/ann-1M/basic     time:   [12.654 ms 12.933 ms 13.215 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
Hottest id: 1012 (elements: 660546), average elements: 4195.426417350294
search/ann-1M/hottest   time:   [8.8394 ms 8.8609 ms 8.8824 ms]

████████████████████████████████████████████████████████████████████████████████████████ 2210455/2210455search/ann-full-25pct/basic
                        time:   [30.351 ms 31.241 ms 32.154 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Hottest id: 1012 (elements: 1456224), average elements: 9279.75874323292
Benchmarking search/ann-full-25pct/hottest: Collecting 100 samples in estimated 6.0418 s (300 iterationssearch/ann-full-25pct/hottest
                        time:   [19.669 ms 19.734 ms 19.801 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

search/movies/basic     time:   [3.9899 ms 4.0528 ms 4.1151 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
Hottest id: 2839 (elements: 32440), average elements: 303.8177823300073
search/movies/hottest   time:   [212.78 µs 213.15 µs 213.54 µs]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe
cargo bench -p sparse; with pruning disabled
Gnuplot not found, using plotters backend
search/random-50k/basic time:   [839.37 µs 844.31 µs 849.08 µs]
                        change: [-2.1879% -1.0074% +0.2180%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low severe
  1 (1.00%) high severe
Hottest id: 482 (elements: 1091), average elements: 376.42551481667505
search/random-50k/hottest
                        time:   [184.76 µs 185.32 µs 186.15 µs]
                        change: [+0.8330% +1.7765% +2.9973%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

search/random-500k/basic
                        time:   [7.6951 ms 7.8274 ms 7.9593 ms]
                        change: [-3.7983% -1.6307% +0.9203%] (p = 0.18 > 0.05)
                        No change in performance detected.
Hottest id: 97 (elements: 10261), average elements: 3755.034260526931
Benchmarking search/random-500k/hottest: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
search/random-500k/hottest
                        time:   [1.6111 ms 1.6142 ms 1.6174 ms]
                        change: [+1.1951% +1.7077% +2.2274%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

search/ann-1M/basic     time:   [12.546 ms 12.822 ms 13.102 ms]
                        change: [-3.7814% -0.8543% +2.2980%] (p = 0.59 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
Hottest id: 1012 (elements: 660546), average elements: 4195.426417350294
search/ann-1M/hottest   time:   [8.6825 ms 8.7062 ms 8.7294 ms]
                        change: [-2.1049% -1.7463% -1.3910%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

search/ann-full-25pct/basic
                        time:   [29.121 ms 29.914 ms 30.715 ms]
                        change: [-8.0551% -4.2505% -0.7468%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Hottest id: 1012 (elements: 1456224), average elements: 9279.75874323292
Benchmarking search/ann-full-25pct/hottest: Collecting 100 samples in estimated 5.7797 s (300 iterationssearch/ann-full-25pct/hottest
                        time:   [19.212 ms 19.273 ms 19.333 ms]
                        change: [-2.7636% -2.3398% -1.9002%] (p = 0.00 < 0.05)
                        Performance has improved.

search/movies/basic     time:   [3.9564 ms 4.0185 ms 4.0797 ms]
                        change: [-3.0176% -0.8470% +1.3052%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
Hottest id: 2839 (elements: 32440), average elements: 303.8177823300073
search/movies/hottest   time:   [204.24 µs 204.63 µs 205.03 µs]
                        change: [-6.5819% -4.7335% -3.1812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

parking_lot = { version = "0.12.2", features = ["deadlock_detection", "serde"] }
pprof = { version = "0.12", features = ["flamegraph", "prost-codec"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean, that it is non-dev dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a workspace dependency, to be referred like this: pprof = { workspace = true }.
In this PR, it's referred only inside of dev-dependencies of collection, dataset, and segment crates.
So, no.

let (nrow, _ncol, nnz) = (*nrow as usize, *ncol as usize, *nnz as usize);

let indptr = Vec::from(transmute_from_u8_to_slice::<u64>(
&mmap.as_ref()[24..24 + 8 * (nrow + 1)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what those numbers mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It recreates the format being used in https://github.com/qdrant/sparse-vectors-benchmark/blob/master/src/sparse_matrix.py.
Added an explanation table in a doc comment.

lib/sparse/src/common/sparse_vector.rs Outdated Show resolved Hide resolved
let query_vectors =
loaders::load_csr_vecs(Dataset::AnnChallengeQueries.download().unwrap()).unwrap();

let index_1m = load_csr_index(Dataset::AnnChallenge1M.download().unwrap(), 1.0).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those msmarco?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same benchmark used in https://github.com/qdrant/sparse-vectors-benchmark. It's embeddings based on MS MARCO. Renamed to NeurIps2023 as it would be accurate name.

&mut self,
permit: Arc<CpuPermit>,
stopped: &AtomicBool,
tick_progress: impl FnMut(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it doesn't need to be Mut. If we use Atomics to count (and it looks like we do) - impl Fn should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to accept the broadest possible type of closure, which is FnMut in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that having FnMut can prevent some compiler-level optimizations

lib/common/dataset/Cargo.toml Outdated Show resolved Hide resolved
}
}

impl<'a> Profiler for FlamegraphProfiler<'a> {
Copy link
Member

@agourlay agourlay May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what alternative do we have to avoid copy pasting this file in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Put it in a separate crate.
  2. Low-effort hack solution: have it in one place and have multiple symlink to it.
  3. Get rid of it. Personally, I haven't found use for it yet as I'm using external profilers, e.g. perf+hotspot or vtune. I've just added it because benches in other crates have it.

@xzfc xzfc merged commit 6f738ca into dev May 7, 2024
17 checks passed
@xzfc xzfc deleted the sparse-benchmarks branch May 7, 2024 20:34
timvisee pushed a commit that referenced this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants