Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrect cosine similarity returned in the search results #33072

Closed
1 task done
jpowie01 opened this issue May 15, 2024 · 13 comments
Closed
1 task done

[Bug]: Incorrect cosine similarity returned in the search results #33072

jpowie01 opened this issue May 15, 2024 · 13 comments
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jpowie01
Copy link

jpowie01 commented May 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.1
- Deployment mode(standalone or cluster): Milvus Lite 2.4.1
- MQ type(rocksmq, pulsar or kafka): -
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.1
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: n1-standard-64 (GCP VM) / 64 vCPU Skylake / 240 GB RAM
- GPU: -
- Others: -

Current Behavior

It looks that the distance value returned from the search on a collection of vectors with a cosine metric similarity is wrong. It's the opposite of what it should be, currently giving us:

  • -1 for proportional vectors,
  • 0 for orthogonal vectors,
  • 1 for opposite vectors.
Screenshot 2024-05-15 at 12 47 37

Expected Behavior

I would expect it to return the value of the cosine similarity, according to the definition, which is:

  • 1 for proportional vectors,
  • 0 for orthogonal vectors,
  • -1 for opposite vectors.

Steps To Reproduce

Here is a full reproduction script:

import json

from pymilvus import DataType, MilvusClient, Collection
from milvus_lite.server_manager import server_manager_instance

local_uri = server_manager_instance.start_and_get_uri("./reproduction.db")

client = MilvusClient(uri=local_uri)
client.drop_collection("points")

schema = MilvusClient.create_schema(
    auto_id=True,
    enable_dynamic_field=False,
)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=4)

index_params = client.prepare_index_params()
index_params.add_index(
    field_name="embedding", 
    index_type="HNSW",
    metric_type="COSINE",
    params={ "M": 256, "efConstruction": 512 }
)

client.create_collection(
    collection_name="points",
    schema=schema,
    index_params=index_params
)

client.insert(
    collection_name="points",
    data=[
        {"embedding": [1, 1, 1, 1]},
        {"embedding": [1, 1, -1, -1]},
        {"embedding": [-1, -1, -1, -1]},
    ],
)

search_output = client.search(
    collection_name="points",
    data=[[-1, -1, -1, -1]],
    limit=5,
    output_fields=["embedding"],
)
result = json.dumps(search_output, indent=4)
print(result)

My environment:

$ python --version
Python 3.10.14

$ pip --version
pip 24.0 from /<redacted>/lib/python3.10/site-packages/pip (python 3.10)

$ pip list | grep milvus
milvus-lite                              2.4.1
pymilvus                                 2.4.1

Milvus Log

No response

Anything else?

No response

@jpowie01 jpowie01 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 15, 2024
@jpowie01 jpowie01 changed the title [Bug]: Incorrect cosine distance returned in the search results [Bug]: Incorrect cosine similarity returned in the search results May 15, 2024
@yanliang567
Copy link
Contributor

/assign @liliu-z
please take a look
/unassign

@sre-ci-robot sre-ci-robot assigned liliu-z and unassigned yanliang567 May 15, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 15, 2024
@xiaofan-luan
Copy link
Contributor

don't quite understand it.
What is the current problem?
The distance result seems to be correct

@yanliang567
Copy link
Contributor

I think searching an exact vector for himself should return 0 as distance, but it returns -1 if cosine.

@xiaofan-luan
Copy link
Contributor

I think searching an exact vector for himself should return 0 as distance, but it returns -1 if cosine.

exact vector should be 1. 0 means not related at all

@xiaofan-luan
Copy link
Contributor

it's opposite direction, then the distance is -1.

@jpowie01
Copy link
Author

Yes, that's exactly the problem. The values I get back from the search query are incorrect.

If you look closer to the example I provided above, I've added three vectors to the database ([1, 1, 1, 1], [1, 1, -1, -1], [-1, -1, -1, -1]) and then searched for a vector [-1, -1, -1, -1]. As a result, I got distance values as following:

  • [1, 1, 1, 1] -> 1.0
  • [1, 1, -1, -1] -> 0.0
  • [-1, -1, -1, -1] -> -1.0 (exact search)

Those are incorrect values, neither for cosine similarity nor cosine distance. From the behaviour I'm seeing, those values are representing cosine similarity multiplied by -1. But... why?

Here is a snippet of code using scipy & scikit-learn computing those metrics on the same vectors:

>>> from scipy.spatial import distance
>>> distance.cosine([-1, -1, -1, -1], [1, 1, 1, 1])
2.0
>>> distance.cosine([-1, -1, -1, -1], [1, 1, -1, -1])
1.0
>>> distance.cosine([-1, -1, -1, -1], [-1, -1, -1, -1])
0.0

>>> from sklearn.metrics.pairwise import cosine_similarity
>>> cosine_similarity([[-1, -1, -1, -1]], [[1, 1, 1, 1]])
array([[-1.]])
>>> cosine_similarity([[-1, -1, -1, -1]], [[1, 1, -1, -1]])
array([[0.]])
>>> cosine_similarity([[-1, -1, -1, -1]], [[-1, -1, -1, -1]])
array([[1.]])

@xiaofan-luan
Copy link
Contributor

/assign @liliu-z

@liliu-z
Copy link
Member

liliu-z commented May 21, 2024

Tried in Knowhere & Milvus side, cannot reproduce this. Will try milvus-lite

@liliu-z
Copy link
Member

liliu-z commented May 21, 2024

Screen Shot 2024-05-21 at 9 15 46 PM Tried same script with Milvus-lite, still cannot reproduce

@jpowie01
Copy link
Author

What versions of Milvus packages did you use for your reproduction? I'll try to bump my environment to the latest version and see if it's still there.

@jpowie01
Copy link
Author

I've just reinstalled my environment from scratch with milvus-lite==2.4.1 & pymilvus==2.4.1 and can confirm the reproduction still works for me (on those specific versions).

However, I've just upgraded both of these packages to the latest available versions:

$ pip list | grep milvus
milvus-lite                           2.4.4
pymilvus                              2.4.3

And it looks that it has fixed it! 🙌

[
    [
        {
            "id": 449917763324477442,
            "distance": 1.0,
            "entity": {
                "embedding": [
                    -1.0,
                    -1.0,
                    -1.0,
                    -1.0
                ]
            }
        },
        {
            "id": 449917763324477441,
            "distance": 0.0,
            "entity": {
                "embedding": [
                    1.0,
                    1.0,
                    -1.0,
                    -1.0
                ]
            }
        },
        {
            "id": 449917763324477440,
            "distance": -1.0,
            "entity": {
                "embedding": [
                    1.0,
                    1.0,
                    1.0,
                    1.0
                ]
            }
        }
    ]
]

Has anything changed there? Or is it just the combination of packages I've used before?

@liliu-z
Copy link
Member

liliu-z commented May 22, 2024

This bug get fixed from milvus-lite 2.4.4

@liliu-z liliu-z removed their assignment May 22, 2024
@jpowie01
Copy link
Author

That's great! Thank you for your help! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants