Lums/sc 42599/implement ivf pq index #279

lums658 · 2024-03-13T04:49:00Z

This PR implements IVF index with PQ encoding. It requires the results from #265 so this is a branch from there.

As the name implies, this index synthesizes the indexing from IVF with the vector encoding from PQ. Ultimately there should be a decomposition of indexes into index representation, vector representation, and distance computation. However, because there is still a need to gather sufficient implementation experience with the different variants of each (and their compositions), the current implementation is monolithic. The new story sc-43058 was created for this refactoring.

The current scheme for C++ indexes includes implementing a class for metadata, for the storage group, and for the index itself. These exist as separate files in the index subdirectory for each of the current indexes. (Note that flat_l2_index and flat_pq_index have not been re-factored this way yet, though placeholder files for the metadata and group classes have been created.

Group and Metadata

The metadataivf_pq_metata essentially includes the metadata for each of IVF Flat and Flat PQ. The group ivf_pq_group contains arrays that are the integration of IVF and PQ:

cluster_centroids -- the centroids used for encoding with PQ
flat_ivf_centroids -- the flat (unencoded) centroids used for IVF indexing. These are obtained by applying kmeans to the flat input vectors.
pq_ivf_centroids -- the PQ encoded centroids used for IVF indexing. These are currently obtained by PQ encoding the flat_ivf_centroids.
pq_ivf_vectors -- PQ encoded vectors from the input, partitioned and shuffled
ivf_index -- IVF index vector (same as in IVF Flat)
ivf_ids -- vector ids (same as in IVF Flat)
distance_tables_* -- distance tables, one table per subspace and one array per table. These should probably go into a single array at some point.
The data for pq_ivf_vectors, ivf_index, and ivf_ids are maintained as a PartitionedMatrix (or tdbPartitionedMatrix).

Ingestion

The first step in ingestion is in train_pq which generates the cluster_centroids -- using sub_kmeans as was used in the previous PQ related PR. (Note that all kmeans functions have been moved to the file kmeans.h). Once we have the cluster_centroids we have the ability to encode all of the input vectors. train_pq also generates the distance_tables.

The train_ivf function applies kmeans to create a set of centroids and partition the input vectors. There are two ways this could be accomplished: Partition the flat vectors first to create a set of flat centroids, or encode the vectors first, apply kmeans to the encoded vectors to create a set of encoded (or unencoded) centroids. We opted for the former in this PR. With the former, we can then create compressed pq_ivf_centroids from the flat_ivf_centroids.

Once we have the ivf_centroids (flat or pq), we can partition and shuffle the input vectors (unencoded or encoded). In our case, we use the flat centroids to create the partitioning and then apply that partitioning to shuffle the encoded vectors. Again, there are multiple variations to try in terms of whether to use encoded or unencoded quantities at various steps in the ingestion / indexing process.

Right now everything happens in the add function:

    train_pq(training_set);   // Create cluster_centroids_, distance_tables_
    train_ivf(training_set);   // Create flat_ivf_centroids_
    unpartitioned_pq_vectors_ = pq_encode(training_set); // encode the input vectors
    pq_ivf_centroids_ = std::move(*pq_encode(flat_ivf_centroids_)); // encode the flat ivf centroids

    auto partition_labels = detail::flat::qv_partition(
        flat_ivf_centroids_, training_set, num_threads_, distance);  // partition

    partitioned_pq_vectors_ = std::make_unique<pq_storage_type>(
        *unpartitioned_pq_vectors_, partition_labels, num_unique_labels);  // shuffle

There is not currently a separate train function, though perhaps this add() function should be split into different parts. These pieces were kept together for the time being to allow experimentation with different ordering of encoding and so forth.

Note that when the create_default_impl function is called during ingestion, the metadata are set with information from the index at the very top of the function, as that information will be used later on in the function.

Queries

Queries are applied by calling the existing query functions to search the ivf index, using an appropriate distance function. If we pass in flat query vectors, we use the asymmetric distance function. If we pass in compressed query vectors, we use the symmetric distance function. Currently we do not compress the queries, so we use the asymmetric distance function. Per sc-43051, we should SIMDize these functions, as well as reorder the loops for encoding, etc -- which should result in 10-20X speedup. The ivf_pq_index class only currently includes query_finite and query_infinite (it does not have calls tot the various qv, nut, and so forth).

Unit Testing

Unit testing performs most of the same tests as for flat_pq and for ivf_flat -- though currently not as thoroughly.

Distributed Computation

This should be as easy to distribute as IVF Flat, since partitioning and querying (etc) are oblivious to vector representation.

# Conflicts: # src/include/detail/flat/qv.h # src/include/detail/ivf/qv.h # src/include/detail/scoring/README.md # src/include/index/flatpq_index.h # src/include/scoring.h

… tests [skip ci]

…nt-ivf-pq-index

src/include/index/ivf_pq_index.h

NikolaosPapailiou · 2024-03-29T14:21:57Z

src/include/index/ivf_pq_index.h

+ // that the distance functions are inlined.
+ // @todo Make this SIMD friendly -- do multiple subspaces at a time
+ // For each (i, j), distances should be stored contiguously
+ float sub_distance_symmetric(auto&& a, auto&& b) const {


We should make the difference between symmetric and asymmetric distances more clear.

From my understanding symmetric means that both queries are PQ embedded and asymmetric that b is embedded and a is not. We also only use the asymmetric distance atm however we still generate distance_tables_ for all cases even if it is only used for symmetric distance computation.

That is also my understanding of symmetric vs asymmetric distance. The index is created to be agnostic to which query might be presented to it, so it computes the distance_tables regardless. It was also intended to have both symmetric and asymmetric queries, like flat_pq_index but I decided to just put in queries to match names/functionality from ivf_flat and picked asymmetric as the underlying query.

We should add support for symmetric queries. If we have the options of finite vs infinite (and maybe even finer grained selection of algorithm) as well as symmetric vs asymmetric we may want to decide on a clean API -- though query_{finite, infinite}_ram_{symmetric,asymmetric} is probably as good a naming scheme as any. I can make this change or we can leave it to another PR. Besides the functions we will also need additional unit tests.

We could conceivably only create (or load) the distance tables the first time a symmetric query is requested. But my own opinion is that we should have the index complete when it is constructed. If this change is to be made, it should probably be its own PR.

Lets add some more comments for symmetric vs asymmetric in this PR and some TODOs and tasks to follow up on supporting symmetric queries. Curious, @lums658 which method do you expect to be the most performant? Shouldn't it be symmetric with using the distance table?

For just the distance function, if you are given an already-encoded query, symmetric is probably faster. However, if you are given an unencoded query vector, encoding the query vector is quite expensive. I haven't done the experiments to see if that would be amortized over an entire query or not.

OTOH, we can accelerate encoding by quite a bit by doing it with SIMD instructions, so that will definitely change the comparison. And of course, we can use SIMD instructions for both the symmetric and asymmetric queries.

We need to do the experiments to actually understand the performance / accuracy tradeoffs.

Story 45153 created for this.

src/include/index/ivf_pq_group.h

…arch into lums/sc-42599/implement-ivf-pq-index

jparismorgan · 2024-04-16T14:59:24Z

src/include/index/ivf_flat_group.h

@@ -179,7 +181,6 @@ class ivf_flat_index_group
 metadata_.base_sizes_ = {0};
 metadata_.partition_history_ = {0};
 metadata_.temp_size_ = 0;
- metadata_.dimension_ = this->get_dimension();


@lums658 could you help explain why we made this change and instead do this->set_dimension(this->cached_index_.get().dimension());? Is there a bug here?

jparismorgan · 2024-04-16T15:02:00Z

src/include/index/ivf_pq_group.h

+
+ public:
+ using index_group_metadata_type = ivf_pq_metadata;
+ // using index_type = Index;


@lums658 could you help explain this? Is this a TODO to change to this later? Or can I remove it?

…arch into lums/sc-42599/implement-ivf-pq-index

jparismorgan · 2024-05-02T12:29:20Z

NOTE(paris): I'm currently waiting to merge this until Vamana is fully working end to end, as that will help keep the changes I need to do when fixing bugs in the group / metadata / etc. down. Once it's working I'll bring this in and apply all changes.

…arch into lums/sc-42599/implement-ivf-pq-index

lums658 added 30 commits March 5, 2024 01:27

Squash merge from lums/sc-41756/avx2-l2-distance [skip ci]

f4433de

Added distance parameterization to ivf/qv.h [skip ci]

c9ce7bc

clang-format [skip ci]

15ad941

parameterize distance function for flatpq_index [skip ci]

4f6f783

Parameterized ivm flat with distance [skip ci]

4c1d9af

Parameterized vamana with distance [skip ci]

8983a7f

Add inner product function object [skip ci]

c72038f

Add tests to unit_flat_qv [skip ci]

c5a2875

Add tests to unit_ivf_qv [skip ci]

cbaae09

Update documentation, clang format

1a79f4d

Merge branch 'main' into lums/sc-40565/distance-parameterization

b12c230

# Conflicts: # src/include/detail/flat/qv.h # src/include/detail/ivf/qv.h # src/include/detail/scoring/README.md # src/include/index/flatpq_index.h # src/include/scoring.h

Fix duplicate symbol error

572f571

Apply comments from PR review

f99ae3e

Fix bug in unrolled inner product, add more unit tests

3f5ad4e

Clang format

e092fc4

Create initial files [skip ci]

1957f2d

Initial scaffolding, new kmeans.h

79387d2

Merge branch 'main' into lums/sc-42599/implement-ivf-pq-index

df0349f

kmeans.h completed, ivf_flat and flat_pq unit_tests passing [skip ci]

bf2e9d3

checkpoint [skip ci]

b02cbed

Initial rough out [skip ci]

dcec9e2

Initial rough out [skip ci]

affd1b3

Final rough out [skip ci]

daaafab

Add index member to index_group [skip ci]

01e349c

unit_pq_group compiles [skip ci]

a1f1931

unit_pq_group passes simple unit test [skip ci]

db2b19e

unit_pq_index compiles [skip ci]

ea93ebe

unit_api_flat_index and unit_ivf_flat_index compile and pass all unit…

a48980c

… tests [skip ci]

unit_ivf_pq_index compiling and passing very simple tests [skip ci]

d432ea9

Merge remote-tracking branch 'origin/main' into lums/sc-42599/impleme…

7d411eb

…nt-ivf-pq-index

add back unit tests

3258106

NikolaosPapailiou reviewed Mar 29, 2024

View reviewed changes

lums658 and others added 2 commits April 12, 2024 13:30

Address PR comments

cd10d9f

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

6abd158

…arch into lums/sc-42599/implement-ivf-pq-index

jparismorgan reviewed Apr 16, 2024

View reviewed changes

jparismorgan added 5 commits April 16, 2024 17:19

fix build

3d1d3fb

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

6e3c6d6

…arch into lums/sc-42599/implement-ivf-pq-index

fix build

c759cf9

lint

3033e92

fix test

5aaddf5

jparismorgan added 15 commits May 13, 2024 15:23

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

5358b6c

…arch into lums/sc-42599/implement-ivf-pq-index

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

e7bc3f3

…arch into lums/sc-42599/implement-ivf-pq-index

fixes

308a91b

fixes

a136026

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

a5f4322

…arch into lums/sc-42599/implement-ivf-pq-index

fix build

fbd7f9b

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

3ad1df8

…arch into lums/sc-42599/implement-ivf-pq-index

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

74eb0c6

…arch into lums/sc-42599/implement-ivf-pq-index

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

4ad3f07

…arch into lums/sc-42599/implement-ivf-pq-index

fixes

1a4ff3f

fix build

6b625c4

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

4c2465c

…arch into lums/sc-42599/implement-ivf-pq-index

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

4cbf200

…arch into lums/sc-42599/implement-ivf-pq-index

fix

fed174c

fix

857bbcf

jparismorgan approved these changes May 17, 2024

View reviewed changes

NikolaosPapailiou approved these changes May 20, 2024

View reviewed changes

jparismorgan merged commit c81fe46 into main May 20, 2024
6 checks passed

jparismorgan deleted the lums/sc-42599/implement-ivf-pq-index branch May 20, 2024 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lums/sc 42599/implement ivf pq index #279

Lums/sc 42599/implement ivf pq index #279

lums658 commented Mar 13, 2024

NikolaosPapailiou Mar 29, 2024 •

edited

lums658 Apr 2, 2024

NikolaosPapailiou Apr 5, 2024

lums658 Apr 12, 2024 •

edited

jparismorgan Apr 16, 2024

jparismorgan Apr 16, 2024

jparismorgan commented May 2, 2024

Lums/sc 42599/implement ivf pq index #279

Lums/sc 42599/implement ivf pq index #279

Conversation

lums658 commented Mar 13, 2024

Group and Metadata

Ingestion

Queries

Unit Testing

Distributed Computation

NikolaosPapailiou Mar 29, 2024 • edited

Choose a reason for hiding this comment

lums658 Apr 2, 2024

Choose a reason for hiding this comment

NikolaosPapailiou Apr 5, 2024

Choose a reason for hiding this comment

lums658 Apr 12, 2024 • edited

Choose a reason for hiding this comment

jparismorgan Apr 16, 2024

Choose a reason for hiding this comment

jparismorgan Apr 16, 2024

Choose a reason for hiding this comment

jparismorgan commented May 2, 2024

NikolaosPapailiou Mar 29, 2024 •

edited

lums658 Apr 12, 2024 •

edited