-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lums/sc 42599/implement ivf pq index #279
Conversation
# Conflicts: # src/include/detail/flat/qv.h # src/include/detail/ivf/qv.h # src/include/detail/scoring/README.md # src/include/index/flatpq_index.h # src/include/scoring.h
// that the distance functions are inlined. | ||
// @todo Make this SIMD friendly -- do multiple subspaces at a time | ||
// For each (i, j), distances should be stored contiguously | ||
float sub_distance_symmetric(auto&& a, auto&& b) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make the difference between symmetric
and asymmetric
distances more clear.
From my understanding symmetric
means that both queries are PQ embedded and asymmetric
that b
is embedded and a
is not. We also only use the asymmetric
distance atm however we still generate distance_tables_
for all cases even if it is only used for symmetric
distance computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is also my understanding of symmetric
vs asymmetric
distance. The index is created to be agnostic to which query might be presented to it, so it computes the distance_tables
regardless. It was also intended to have both symmetric
and asymmetric
queries, like flat_pq_index
but I decided to just put in queries to match names/functionality from ivf_flat
and picked asymmetric
as the underlying query.
We should add support for symmetric
queries. If we have the options of finite vs infinite (and maybe even finer grained selection of algorithm) as well as symmetric
vs asymmetric
we may want to decide on a clean API -- though query_{finite, infinite}_ram_{symmetric,asymmetric}
is probably as good a naming scheme as any. I can make this change or we can leave it to another PR. Besides the functions we will also need additional unit tests.
We could conceivably only create (or load) the distance tables the first time a symmetric query is requested. But my own opinion is that we should have the index complete when it is constructed. If this change is to be made, it should probably be its own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add some more comments for symmetric vs asymmetric in this PR and some TODOs and tasks to follow up on supporting symmetric
queries. Curious, @lums658 which method do you expect to be the most performant? Shouldn't it be symmetric
with using the distance table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For just the distance function, if you are given an already-encoded query, symmetric is probably faster. However, if you are given an unencoded query vector, encoding the query vector is quite expensive. I haven't done the experiments to see if that would be amortized over an entire query or not.
OTOH, we can accelerate encoding by quite a bit by doing it with SIMD instructions, so that will definitely change the comparison. And of course, we can use SIMD instructions for both the symmetric and asymmetric queries.
We need to do the experiments to actually understand the performance / accuracy tradeoffs.
Story 45153 created for this.
src/include/index/ivf_flat_group.h
Outdated
@@ -179,7 +181,6 @@ class ivf_flat_index_group | |||
metadata_.base_sizes_ = {0}; | |||
metadata_.partition_history_ = {0}; | |||
metadata_.temp_size_ = 0; | |||
metadata_.dimension_ = this->get_dimension(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lums658 could you help explain why we made this change and instead do this->set_dimension(this->cached_index_.get().dimension());
? Is there a bug here?
src/include/index/ivf_pq_group.h
Outdated
|
||
public: | ||
using index_group_metadata_type = ivf_pq_metadata; | ||
// using index_type = Index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lums658 could you help explain this? Is this a TODO to change to this later? Or can I remove it?
NOTE(paris): I'm currently waiting to merge this until Vamana is fully working end to end, as that will help keep the changes I need to do when fixing bugs in the group / metadata / etc. down. Once it's working I'll bring this in and apply all changes. |
This PR implements IVF index with PQ encoding. It requires the results from #265 so this is a branch from there.
As the name implies, this index synthesizes the indexing from IVF with the vector encoding from PQ. Ultimately there should be a decomposition of indexes into index representation, vector representation, and distance computation. However, because there is still a need to gather sufficient implementation experience with the different variants of each (and their compositions), the current implementation is monolithic. The new story sc-43058 was created for this refactoring.
The current scheme for C++ indexes includes implementing a class for metadata, for the storage group, and for the index itself. These exist as separate files in the
index
subdirectory for each of the current indexes. (Note thatflat_l2_index
andflat_pq_index
have not been re-factored this way yet, though placeholder files for the metadata and group classes have been created.Group and Metadata
The metadata
ivf_pq_metata
essentially includes the metadata for each of IVF Flat and Flat PQ. The groupivf_pq_group
contains arrays that are the integration of IVF and PQ:cluster_centroids
-- the centroids used for encoding with PQflat_ivf_centroids
-- the flat (unencoded) centroids used for IVF indexing. These are obtained by applying kmeans to the flat input vectors.pq_ivf_centroids
-- the PQ encoded centroids used for IVF indexing. These are currently obtained by PQ encoding theflat_ivf_centroids
.pq_ivf_vectors
-- PQ encoded vectors from the input, partitioned and shuffledivf_index
-- IVF index vector (same as in IVF Flat)ivf_ids
-- vector ids (same as in IVF Flat)distance_tables_*
-- distance tables, one table per subspace and one array per table. These should probably go into a single array at some point.The data for
pq_ivf_vectors
,ivf_index
, andivf_ids
are maintained as aPartitionedMatrix
(ortdbPartitionedMatrix
).Ingestion
The first step in ingestion is in
train_pq
which generates thecluster_centroids
-- usingsub_kmeans
as was used in the previous PQ related PR. (Note that all kmeans functions have been moved to the file kmeans.h). Once we have thecluster_centroids
we have the ability to encode all of the input vectors.train_pq
also generates thedistance_tables
.The
train_ivf
function applies kmeans to create a set of centroids and partition the input vectors. There are two ways this could be accomplished: Partition the flat vectors first to create a set of flat centroids, or encode the vectors first, apply kmeans to the encoded vectors to create a set of encoded (or unencoded) centroids. We opted for the former in this PR. With the former, we can then create compressedpq_ivf_centroids
from theflat_ivf_centroids
.Once we have the
ivf_centroids
(flat
orpq
), we can partition and shuffle the input vectors (unencoded or encoded). In our case, we use the flat centroids to create the partitioning and then apply that partitioning to shuffle the encoded vectors. Again, there are multiple variations to try in terms of whether to use encoded or unencoded quantities at various steps in the ingestion / indexing process.Right now everything happens in the
add
function:There is not currently a separate
train
function, though perhaps thisadd()
function should be split into different parts. These pieces were kept together for the time being to allow experimentation with different ordering of encoding and so forth.Note that when the
create_default_impl
function is called during ingestion, the metadata are set with information from the index at the very top of the function, as that information will be used later on in the function.Queries
Queries are applied by calling the existing query functions to search the ivf index, using an appropriate distance function. If we pass in flat query vectors, we use the asymmetric distance function. If we pass in compressed query vectors, we use the symmetric distance function. Currently we do not compress the queries, so we use the asymmetric distance function. Per sc-43051, we should SIMDize these functions, as well as reorder the loops for encoding, etc -- which should result in 10-20X speedup. The
ivf_pq_index
class only currently includesquery_finite
andquery_infinite
(it does not have calls tot the various qv, nut, and so forth).Unit Testing
Unit testing performs most of the same tests as for
flat_pq
and forivf_flat
-- though currently not as thoroughly.Distributed Computation
This should be as easy to distribute as IVF Flat, since partitioning and querying (etc) are oblivious to vector representation.