New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate COSINE VectorSimilarity function #13308
base: main
Are you sure you want to change the base?
Conversation
As per my understanding the segment present at above directory already has |
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java
Outdated
Show resolved
Hide resolved
Correct. Practically, this means keeping the enum constant but marking it as deprecated, and modifying IndexingChain to fail using COSINE on 10.x Lucene indexes. It would be nice to also add a new file format under lucene/codecs that does cosine similarity, like we just added for hamming distance. This could help with the migration for some users. |
Thanks @jpountz for confirming and suggestions. |
@@ -302,7 +301,6 @@ private static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) th | |||
List.of( | |||
VectorSimilarityFunction.EUCLIDEAN, | |||
VectorSimilarityFunction.DOT_PRODUCT, | |||
VectorSimilarityFunction.COSINE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is exactly the problem. There must be a null entry, so the ordinals in index format stay alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ The reason for the list here is exactly to support and allow such evolution and separation from the enum ordinals. A null here trivially need to a change of List implementation type. And tests need to be carefully updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, null does not work with default lists :-) Also downstream the code needs to check for a null value.
In any case we can later also go back to a Map<Integer,VectorSimilarity> or to an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @uschindler @ChrisHegarty for the pointers, but IIUC, we cannot add a null entry to VectorSimilarityFunction map for COSINE in these files because we still want to use COSINE similarity function for Lucene versions < 10.0, right? Instead we need to create a separate file for new versions. Please correct me if I am wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a larger review.
I think it would be prudent to first attempt to deprecate and handle internal usage, etc. After that is merged and backported, another PR can be made to fully remove it from main only on writes.
At least, that would make reviewing & testing this simpler :)
@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends BackwardsCompatibilityTestB | |||
private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = LUCENE_9_0_0.major; | |||
private static final String KNN_VECTOR_FIELD = "knn_field"; | |||
private static final FieldType KNN_VECTOR_FIELD_TYPE = | |||
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE); | |||
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure this will work as simply as this. We will index NEW fields into an index created with the OLD code (see where we call addDoc
), meaning you will be trying to index a dot_product
knn field into an index with a cosine
field.
You will have to handle the index versions from before a certain version (thus on cosine) to ones created after (and thus dot-product).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f}; | ||
private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same sort of comment as above on versioning and that we index fields into an old index.
@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) { | |||
} | |||
}, | |||
|
|||
/** | |||
* Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to deprecate this first, get all the internal paths updated to not use it.
Then the deprecation can be merged and backported. Then work on removing it from main.
Mostly because the deprecation path and internal code usage will all have to occur anyways :)
@@ -30,23 +28,6 @@ | |||
public class ScalarQuantizedRandomVectorScorer | |||
extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> { | |||
|
|||
public static float quantizeQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in Lucene 10, we will need to support quantizing a query sent to a Lucene index that uses Cosine.
return (1 + cosine(v1, v2)) / 2; | ||
} | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will get tricky. If you remove this ordinal (I think it would be wise to deprecate first, work through the warnings, etc. get everything passing, etc.), field info & knn vector format readers will need to somehow distinguish between "Max inner product" (whose ordinal decreases to cosine's) and "cosine but in an old index version".
Maybe on write, we adjust the ordinal dynamically, knowing what they are (thus correcting MIP's ordinal on write, allowing for readers to distinguish).
d500c89
to
589d1e2
Compare
Thanks @benwtrent for all the feedback. |
That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE. |
Sorry I think I am missing something, I think we cannot remove COSINE from those places, since we require them in LUCENE version < 10.0. They are mostly being used in FieldInfosFormat, VectorReader, ScalarQuantizedVectorWriter classes. I think the main task is to first get rid of dependency on enum for VectorSimilarityFunction and instead have some interface to get those functions and support. |
@Pulkitg64 could you deprecate also |
lucene/CHANGES.txt
Outdated
* GITHUB#13281: Mark COSINE VectorSimilarityFunction as deprecated. (Pulkit Gupta) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be under 9.11 as we will backport the deprecation as a way to warn users that it is going away in 10.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Make sense. Will change in next revision
Thanks for mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though you it would be good get another reviewers approval too.
Description
Tries to solve #13281 by deprecating
COSINE
VectorSimilarity function for Vector Search.