Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lums/sc 42599/implement ivf pq index #279

Merged
merged 67 commits into from
May 20, 2024

Conversation

lums658
Copy link
Collaborator

@lums658 lums658 commented Mar 13, 2024

This PR implements IVF index with PQ encoding. It requires the results from #265 so this is a branch from there.

As the name implies, this index synthesizes the indexing from IVF with the vector encoding from PQ. Ultimately there should be a decomposition of indexes into index representation, vector representation, and distance computation. However, because there is still a need to gather sufficient implementation experience with the different variants of each (and their compositions), the current implementation is monolithic. The new story sc-43058 was created for this refactoring.

The current scheme for C++ indexes includes implementing a class for metadata, for the storage group, and for the index itself. These exist as separate files in the index subdirectory for each of the current indexes. (Note that flat_l2_index and flat_pq_index have not been re-factored this way yet, though placeholder files for the metadata and group classes have been created.

Group and Metadata

The metadataivf_pq_metata essentially includes the metadata for each of IVF Flat and Flat PQ. The group ivf_pq_group contains arrays that are the integration of IVF and PQ:

  • cluster_centroids -- the centroids used for encoding with PQ
  • flat_ivf_centroids -- the flat (unencoded) centroids used for IVF indexing. These are obtained by applying kmeans to the flat input vectors.
  • pq_ivf_centroids -- the PQ encoded centroids used for IVF indexing. These are currently obtained by PQ encoding the flat_ivf_centroids.
  • pq_ivf_vectors -- PQ encoded vectors from the input, partitioned and shuffled
  • ivf_index -- IVF index vector (same as in IVF Flat)
  • ivf_ids -- vector ids (same as in IVF Flat)
  • distance_tables_* -- distance tables, one table per subspace and one array per table. These should probably go into a single array at some point.
    The data for pq_ivf_vectors, ivf_index, and ivf_ids are maintained as a PartitionedMatrix (or tdbPartitionedMatrix).

Ingestion

The first step in ingestion is in train_pq which generates the cluster_centroids -- using sub_kmeans as was used in the previous PQ related PR. (Note that all kmeans functions have been moved to the file kmeans.h). Once we have the cluster_centroids we have the ability to encode all of the input vectors. train_pq also generates the distance_tables.

The train_ivf function applies kmeans to create a set of centroids and partition the input vectors. There are two ways this could be accomplished: Partition the flat vectors first to create a set of flat centroids, or encode the vectors first, apply kmeans to the encoded vectors to create a set of encoded (or unencoded) centroids. We opted for the former in this PR. With the former, we can then create compressed pq_ivf_centroids from the flat_ivf_centroids.

Once we have the ivf_centroids (flat or pq), we can partition and shuffle the input vectors (unencoded or encoded). In our case, we use the flat centroids to create the partitioning and then apply that partitioning to shuffle the encoded vectors. Again, there are multiple variations to try in terms of whether to use encoded or unencoded quantities at various steps in the ingestion / indexing process.

Right now everything happens in the add function:

    train_pq(training_set);   // Create cluster_centroids_, distance_tables_
    train_ivf(training_set);   // Create flat_ivf_centroids_
    unpartitioned_pq_vectors_ = pq_encode(training_set); // encode the input vectors
    pq_ivf_centroids_ = std::move(*pq_encode(flat_ivf_centroids_)); // encode the flat ivf centroids

    auto partition_labels = detail::flat::qv_partition(
        flat_ivf_centroids_, training_set, num_threads_, distance);  // partition

    partitioned_pq_vectors_ = std::make_unique<pq_storage_type>(
        *unpartitioned_pq_vectors_, partition_labels, num_unique_labels);  // shuffle

There is not currently a separate train function, though perhaps this add() function should be split into different parts. These pieces were kept together for the time being to allow experimentation with different ordering of encoding and so forth.

Note that when the create_default_impl function is called during ingestion, the metadata are set with information from the index at the very top of the function, as that information will be used later on in the function.

Queries

Queries are applied by calling the existing query functions to search the ivf index, using an appropriate distance function. If we pass in flat query vectors, we use the asymmetric distance function. If we pass in compressed query vectors, we use the symmetric distance function. Currently we do not compress the queries, so we use the asymmetric distance function. Per sc-43051, we should SIMDize these functions, as well as reorder the loops for encoding, etc -- which should result in 10-20X speedup. The ivf_pq_index class only currently includes query_finite and query_infinite (it does not have calls tot the various qv, nut, and so forth).

Unit Testing

Unit testing performs most of the same tests as for flat_pq and for ivf_flat -- though currently not as thoroughly.

Distributed Computation

This should be as easy to distribute as IVF Flat, since partitioning and querying (etc) are oblivious to vector representation.

# Conflicts:
#	src/include/detail/flat/qv.h
#	src/include/detail/ivf/qv.h
#	src/include/detail/scoring/README.md
#	src/include/index/flatpq_index.h
#	src/include/scoring.h
src/include/index/ivf_pq_index.h Show resolved Hide resolved
// that the distance functions are inlined.
// @todo Make this SIMD friendly -- do multiple subspaces at a time
// For each (i, j), distances should be stored contiguously
float sub_distance_symmetric(auto&& a, auto&& b) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the difference between symmetric and asymmetric distances more clear.

From my understanding symmetric means that both queries are PQ embedded and asymmetric that b is embedded and a is not. We also only use the asymmetric distance atm however we still generate distance_tables_ for all cases even if it is only used for symmetric distance computation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is also my understanding of symmetric vs asymmetric distance. The index is created to be agnostic to which query might be presented to it, so it computes the distance_tables regardless. It was also intended to have both symmetric and asymmetric queries, like flat_pq_index but I decided to just put in queries to match names/functionality from ivf_flat and picked asymmetric as the underlying query.

We should add support for symmetric queries. If we have the options of finite vs infinite (and maybe even finer grained selection of algorithm) as well as symmetric vs asymmetric we may want to decide on a clean API -- though query_{finite, infinite}_ram_{symmetric,asymmetric} is probably as good a naming scheme as any. I can make this change or we can leave it to another PR. Besides the functions we will also need additional unit tests.

We could conceivably only create (or load) the distance tables the first time a symmetric query is requested. But my own opinion is that we should have the index complete when it is constructed. If this change is to be made, it should probably be its own PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add some more comments for symmetric vs asymmetric in this PR and some TODOs and tasks to follow up on supporting symmetric queries. Curious, @lums658 which method do you expect to be the most performant? Shouldn't it be symmetric with using the distance table?

Copy link
Collaborator Author

@lums658 lums658 Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For just the distance function, if you are given an already-encoded query, symmetric is probably faster. However, if you are given an unencoded query vector, encoding the query vector is quite expensive. I haven't done the experiments to see if that would be amortized over an entire query or not.

OTOH, we can accelerate encoding by quite a bit by doing it with SIMD instructions, so that will definitely change the comparison. And of course, we can use SIMD instructions for both the symmetric and asymmetric queries.

We need to do the experiments to actually understand the performance / accuracy tradeoffs.

Story 45153 created for this.

src/include/index/ivf_pq_group.h Show resolved Hide resolved
@@ -179,7 +181,6 @@ class ivf_flat_index_group
metadata_.base_sizes_ = {0};
metadata_.partition_history_ = {0};
metadata_.temp_size_ = 0;
metadata_.dimension_ = this->get_dimension();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lums658 could you help explain why we made this change and instead do this->set_dimension(this->cached_index_.get().dimension());? Is there a bug here?


public:
using index_group_metadata_type = ivf_pq_metadata;
// using index_type = Index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lums658 could you help explain this? Is this a TODO to change to this later? Or can I remove it?

@jparismorgan
Copy link
Contributor

NOTE(paris): I'm currently waiting to merge this until Vamana is fully working end to end, as that will help keep the changes I need to do when fixing bugs in the group / metadata / etc. down. Once it's working I'll bring this in and apply all changes.

@jparismorgan jparismorgan merged commit c81fe46 into main May 20, 2024
6 checks passed
@jparismorgan jparismorgan deleted the lums/sc-42599/implement-ivf-pq-index branch May 20, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants