Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate COSINE VectorSimilarity function #13308

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Pulkitg64
Copy link
Contributor

Description

Tries to solve #13281 by deprecating COSINE VectorSimilarity function for Vector Search.

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Apr 15, 2024

TestInt8HnswBackwardsCompatibility.java and TestBasicBackwardsCompatibility test cases failing for Lucene versions older than LUCENE_10_0_0. For older Lucene version, we try to read Index directory from resources folder. For example for Lucene-9.10.0, the directory path is located at lucene/backward-codecs/src/test/org/apache/lucene/backward_index/index.9.10.0-cfs.zip.

As per my understanding the segment present at above directory already has KNNVectorField with MAXIMUM_INNER_PRODUCT as vector similarity function and in above test we try to change similarity function to DOT_PRODUCT due to which there is conflict happening when adding more docs to it and hence these test cases failing.

@jpountz
Copy link
Contributor

jpountz commented Apr 17, 2024

Oh okay, that means we need to remove support for COSINE just from indexing side not from searching side?

Correct. Practically, this means keeping the enum constant but marking it as deprecated, and modifying IndexingChain to fail using COSINE on 10.x Lucene indexes.

It would be nice to also add a new file format under lucene/codecs that does cosine similarity, like we just added for hamming distance. This could help with the migration for some users.

@Pulkitg64
Copy link
Contributor Author

Thanks @jpountz for confirming and suggestions.

@@ -302,7 +301,6 @@ private static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) th
List.of(
VectorSimilarityFunction.EUCLIDEAN,
VectorSimilarityFunction.DOT_PRODUCT,
VectorSimilarityFunction.COSINE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is exactly the problem. There must be a null entry, so the ordinals in index format stay alive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ The reason for the list here is exactly to support and allow such evolution and separation from the enum ordinals. A null here trivially need to a change of List implementation type. And tests need to be carefully updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, null does not work with default lists :-) Also downstream the code needs to check for a null value.

In any case we can later also go back to a Map<Integer,VectorSimilarity> or to an array.

Copy link
Contributor Author

@Pulkitg64 Pulkitg64 Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @uschindler @ChrisHegarty for the pointers, but IIUC, we cannot add a null entry to VectorSimilarityFunction map for COSINE in these files because we still want to use COSINE similarity function for Lucene versions < 10.0, right? Instead we need to create a separate file for new versions. Please correct me if I am wrong here.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a larger review.

I think it would be prudent to first attempt to deprecate and handle internal usage, etc. After that is merged and backported, another PR can be made to fully remove it from main only on writes.

At least, that would make reviewing & testing this simpler :)

@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends BackwardsCompatibilityTestB
private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = LUCENE_9_0_0.major;
private static final String KNN_VECTOR_FIELD = "knn_field";
private static final FieldType KNN_VECTOR_FIELD_TYPE =
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure this will work as simply as this. We will index NEW fields into an index created with the OLD code (see where we call addDoc), meaning you will be trying to index a dot_product knn field into an index with a cosine field.

You will have to handle the index versions from before a certain version (thus on cosine) to ones created after (and thus dot-product).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f};
private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same sort of comment as above on versioning and that we index fields into an old index.

@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
}
},

/**
* Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to deprecate this first, get all the internal paths updated to not use it.

Then the deprecation can be merged and backported. Then work on removing it from main.

Mostly because the deprecation path and internal code usage will all have to occur anyways :)

@@ -30,23 +28,6 @@
public class ScalarQuantizedRandomVectorScorer
extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> {

public static float quantizeQuery(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in Lucene 10, we will need to support quantizing a query sent to a Lucene index that uses Cosine.

return (1 + cosine(v1, v2)) / 2;
}
},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will get tricky. If you remove this ordinal (I think it would be wise to deprecate first, work through the warnings, etc. get everything passing, etc.), field info & knn vector format readers will need to somehow distinguish between "Max inner product" (whose ordinal decreases to cosine's) and "cosine but in an old index version".

Maybe on write, we adjust the ordinal dynamically, knowing what they are (thus correcting MIP's ordinal on write, allowing for readers to distinguish).

@Pulkitg64
Copy link
Contributor Author

Thanks @benwtrent for all the feedback.
I have raised another revision just to mark the COSINE function as deprecated. I didn't find any internal usages of this function.
In a followup PR, I will remove COSINE from writes for LUCENE version >=10.0.0

@Pulkitg64 Pulkitg64 requested a review from benwtrent May 7, 2024 17:26
@benwtrent
Copy link
Member

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

@Pulkitg64
Copy link
Contributor Author

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

Sorry I think I am missing something, I think we cannot remove COSINE from those places, since we require them in LUCENE version < 10.0. They are mostly being used in FieldInfosFormat, VectorReader, ScalarQuantizedVectorWriter classes.

I think the main task is to first get rid of dependency on enum for VectorSimilarityFunction and instead have some interface to get those functions and support.

@benwtrent
Copy link
Member

@Pulkitg64 could you deprecate also VectorUtil.cosine? That is technically a public API.

Comment on lines 105 to 106
* GITHUB#13281: Mark COSINE VectorSimilarityFunction as deprecated. (Pulkit Gupta)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be under 9.11 as we will backport the deprecation as a way to warn users that it is going away in 10.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Make sense. Will change in next revision
Thanks for mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though you it would be good get another reviewers approval too.

@Pulkitg64 Pulkitg64 requested a review from benwtrent May 16, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants