Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: INVERTED scalar filter has low precision in query/search #32717

Closed
1 task done
ghallsimpsons opened this issue Apr 29, 2024 · 7 comments
Closed
1 task done

[Bug]: INVERTED scalar filter has low precision in query/search #32717

ghallsimpsons opened this issue Apr 29, 2024 · 7 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ghallsimpsons
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.0
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: indexnode: 4x(2 CPU, 2GB); querynode: 2x(8CPU, 32GB)
- GPU: No
- Others:

Current Behavior

When running client.query(..., expr="my_ind == 1") where my_ind is of int type (tested w/ int16 and int32) and the index is INVERTED, only a small (though statistically significant) fraction of the results satisfy the condition. Typical precision is 20-40% (with a 10% underlying density). STL_SORT and no index both have 100% precision.

Expected Behavior

Either query(..., expr="my_ind == 1") should have 100% precision, or the documentation should be updated to describe the expected behavior.

Steps To Reproduce

from pymilvus import FieldSchema, CollectionSchema, DataType, MilvusClient
import numpy as np

idx = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)
no_index = FieldSchema(name="no_index", dtype=DataType.INT16)
default_index = FieldSchema(name="default_index", dtype=DataType.INT16)
inv_index = FieldSchema(name="inv_index", dtype=DataType.INT16)
stl_index = FieldSchema(name="stl_index", dtype=DataType.INT16)
schema = CollectionSchema(fields=[idx, vector, no_index, default_index, inv_index, stl_index], auto_id=True)
client = MilvusClient()
client.drop_collection("index_test")
client.create_collection("index_test", schema=schema)

# Create (or remove) indices
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="default_index",
    index_name="default_index"
)
index_params.add_index(
    field_name="inv_index",
    index_type="INVERTED",
    index_name="inv_index"
)
index_params.add_index(
    field_name="stl_index",
    index_type="STL_SORT",
    index_name="stl_index"
)
index_params.add_index(
    field_name="vector",
    index_type="IVF_SQ8",
    metric_type="L2",
    params={"nlist": 128},
)
client.create_index(
  collection_name="index_test",
  index_params=index_params
)
client.drop_index("index_test", "no_index")

# Make the collection large enough that the indexes are used
for _ in range(10000):
    data = []
    for _ in range(100):
        data.append(
            {
                "vector": np.random.rand(128),
                "no_index": np.random.randint(1000),
                "default_index": np.random.randint(1000),
                "inv_index": np.random.randint(1000),
                "stl_index": np.random.randint(1000),
            }
        )
    client.insert(
        "index_test",
        data=data,
    )

for key in ["no_index", "default_index", "stl_index", "inv_index"]:
    filt = f"{key} in {[i for i in range(1, 1000, 10)]}"
    client.load_collection("index_test")
    all_rows = client.query(
        "index_test",
        limit=128,
        output_fields=["no_index", "default_index", "inv_index", "stl_index"],
        filter=filt,
    )
correct_rows = [row[key] for row in all_rows if row[key] % 10 == 1]
print(f"Index {key}: Total of {len(all_rows)} rows")
print(f"Index {key}: Total of {len(correct_rows)} correct rows")


### Milvus Log

_No response_

### Anything else?

Based on these results, I believe this documentation is also wrong, and that the default scalar index for v2.4 is `INVERTED`: https://milvus.io/docs/scalar_index.md#Default-indexing
@ghallsimpsons ghallsimpsons added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024
@yanliang567
Copy link
Contributor

/assign @longjiquan
please help to take a look, meanwhile, i will try to reproduce it in house

@yanliang567 yanliang567 added this to the 2.4.1 milestone Apr 30, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 30, 2024
@xiaofan-luan
Copy link
Contributor

INVERTED

@ghallsimpsons should you use same random number for different fields?
otherwise how did you specify your ground truth?
both index should have 100% recall.

@ghallsimpsons
Copy link
Author

@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.

Hi ~xiaofan-luan, thanks for helping look into this. There is no ground truth here per se, except for what I am requesting via the query. That is, if I perform a search and add the filter inv_index == 1, I would expect every returned row to have inv_index == 1. This is true of the STL index and the no-index case, but not for the inverted index.

@xiaofan-luan
Copy link
Contributor

could you share you code and what is the result you get?

@yanliang567
Copy link
Contributor

I have reproduced the issue in house with the code above.

Index no_index: Total of 128 rows
Index no_index: Total of 128 correct rows
Index default_index: Total of 128 rows
Index default_index: Total of 49 correct rows
Index stl_index: Total of 128 rows
Index stl_index: Total of 128 correct rows
Index inv_index: Total of 128 rows
Index inv_index: Total of 49 correct rows

we can see that when filtering with the inverted field, it returns some results that do not in the filter list. e.g.
image

@longjiquan
Copy link
Contributor

thanks for reporting the bug, @ghallsimpsons , already fixed in #32858

sre-ci-robot pushed a commit that referenced this issue May 8, 2024
@ghallsimpsons
Copy link
Author

Very nice, thanks for the quick fix! I'll give it a go again when 2.4.2 is released.

longjiquan added a commit to longjiquan/milvus that referenced this issue May 9, 2024
sre-ci-robot pushed a commit that referenced this issue May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants