[Bug]: INVERTED scalar filter has low precision in query/search #32717

ghallsimpsons · 2024-04-29T17:52:06Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.0
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: indexnode: 4x(2 CPU, 2GB); querynode: 2x(8CPU, 32GB)
- GPU: No
- Others:

Current Behavior

When running client.query(..., expr="my_ind == 1") where my_ind is of int type (tested w/ int16 and int32) and the index is INVERTED, only a small (though statistically significant) fraction of the results satisfy the condition. Typical precision is 20-40% (with a 10% underlying density). STL_SORT and no index both have 100% precision.

Expected Behavior

Either query(..., expr="my_ind == 1") should have 100% precision, or the documentation should be updated to describe the expected behavior.

Steps To Reproduce

from pymilvus import FieldSchema, CollectionSchema, DataType, MilvusClient
import numpy as np

idx = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)
no_index = FieldSchema(name="no_index", dtype=DataType.INT16)
default_index = FieldSchema(name="default_index", dtype=DataType.INT16)
inv_index = FieldSchema(name="inv_index", dtype=DataType.INT16)
stl_index = FieldSchema(name="stl_index", dtype=DataType.INT16)
schema = CollectionSchema(fields=[idx, vector, no_index, default_index, inv_index, stl_index], auto_id=True)
client = MilvusClient()
client.drop_collection("index_test")
client.create_collection("index_test", schema=schema)

# Create (or remove) indices
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="default_index",
    index_name="default_index"
)
index_params.add_index(
    field_name="inv_index",
    index_type="INVERTED",
    index_name="inv_index"
)
index_params.add_index(
    field_name="stl_index",
    index_type="STL_SORT",
    index_name="stl_index"
)
index_params.add_index(
    field_name="vector",
    index_type="IVF_SQ8",
    metric_type="L2",
    params={"nlist": 128},
)
client.create_index(
  collection_name="index_test",
  index_params=index_params
)
client.drop_index("index_test", "no_index")

# Make the collection large enough that the indexes are used
for _ in range(10000):
    data = []
    for _ in range(100):
        data.append(
            {
                "vector": np.random.rand(128),
                "no_index": np.random.randint(1000),
                "default_index": np.random.randint(1000),
                "inv_index": np.random.randint(1000),
                "stl_index": np.random.randint(1000),
            }
        )
    client.insert(
        "index_test",
        data=data,
    )

for key in ["no_index", "default_index", "stl_index", "inv_index"]:
    filt = f"{key} in {[i for i in range(1, 1000, 10)]}"
    client.load_collection("index_test")
    all_rows = client.query(
        "index_test",
        limit=128,
        output_fields=["no_index", "default_index", "inv_index", "stl_index"],
        filter=filt,
    )
correct_rows = [row[key] for row in all_rows if row[key] % 10 == 1]
print(f"Index {key}: Total of {len(all_rows)} rows")
print(f"Index {key}: Total of {len(correct_rows)} correct rows")



### Milvus Log

_No response_

### Anything else?

Based on these results, I believe this documentation is also wrong, and that the default scalar index for v2.4 is `INVERTED`: https://milvus.io/docs/scalar_index.md#Default-indexing

The text was updated successfully, but these errors were encountered:

yanliang567 · 2024-04-30T09:22:54Z

/assign @longjiquan
please help to take a look, meanwhile, i will try to reproduce it in house

xiaofan-luan · 2024-05-05T03:24:27Z

INVERTED

@ghallsimpsons should you use same random number for different fields?
otherwise how did you specify your ground truth?
both index should have 100% recall.

ghallsimpsons · 2024-05-06T18:24:05Z

@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.

Hi ~xiaofan-luan, thanks for helping look into this. There is no ground truth here per se, except for what I am requesting via the query. That is, if I perform a search and add the filter inv_index == 1, I would expect every returned row to have inv_index == 1. This is true of the STL index and the no-index case, but not for the inverted index.

xiaofan-luan · 2024-05-07T02:09:45Z

could you share you code and what is the result you get?

yanliang567 · 2024-05-07T03:36:48Z

I have reproduced the issue in house with the code above.

Index no_index: Total of 128 rows
Index no_index: Total of 128 correct rows
Index default_index: Total of 128 rows
Index default_index: Total of 49 correct rows
Index stl_index: Total of 128 rows
Index stl_index: Total of 128 correct rows
Index inv_index: Total of 128 rows
Index inv_index: Total of 49 correct rows

we can see that when filtering with the inverted field, it returns some results that do not in the filter list. e.g.

longjiquan · 2024-05-08T08:51:02Z

thanks for reporting the bug, @ghallsimpsons , already fixed in #32858

issue: #32717 --------- Signed-off-by: longjiquan <[email protected]>

ghallsimpsons · 2024-05-08T17:25:12Z

Very nice, thanks for the quick fix! I'll give it a go again when 2.4.2 is released.

issue: milvus-io#32717 --------- Signed-off-by: longjiquan <[email protected]>

issue: #32717 pr: #32858 --------- Signed-off-by: longjiquan <[email protected]>

ghallsimpsons added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024

ghallsimpsons assigned yanliang567 Apr 29, 2024

sre-ci-robot assigned longjiquan Apr 30, 2024

yanliang567 added this to the 2.4.1 milestone Apr 30, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 30, 2024

yanliang567 removed their assignment May 7, 2024

yanliang567 modified the milestones: 2.4.1, 2.4.2 May 7, 2024

longjiquan mentioned this issue May 8, 2024

fix: make sure inverted index has only one segment #32858

Merged

sre-ci-robot pushed a commit that referenced this issue May 8, 2024

fix: make sure inverted index has only one segment (#32858)

035a508

issue: #32717 --------- Signed-off-by: longjiquan <[email protected]>

ghallsimpsons closed this as completed May 8, 2024

longjiquan added a commit to longjiquan/milvus that referenced this issue May 9, 2024

fix: make sure inverted index has only one segment (milvus-io#32858)

ea64d36

issue: milvus-io#32717 --------- Signed-off-by: longjiquan <[email protected]>

longjiquan mentioned this issue May 9, 2024

fix: make sure inverted index has only one segment (#32858) #32877

Merged

sre-ci-robot pushed a commit that referenced this issue May 9, 2024

fix: make sure inverted index has only one segment (#32858) (#32877)

7a00bce

issue: #32717 pr: #32858 --------- Signed-off-by: longjiquan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: INVERTED scalar filter has low precision in query/search #32717

[Bug]: INVERTED scalar filter has low precision in query/search #32717

ghallsimpsons commented Apr 29, 2024

yanliang567 commented Apr 30, 2024

xiaofan-luan commented May 5, 2024

ghallsimpsons commented May 6, 2024

xiaofan-luan commented May 7, 2024

yanliang567 commented May 7, 2024

longjiquan commented May 8, 2024

ghallsimpsons commented May 8, 2024

[Bug]: INVERTED scalar filter has low precision in query/search #32717

[Bug]: INVERTED scalar filter has low precision in query/search #32717

Comments

ghallsimpsons commented Apr 29, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

yanliang567 commented Apr 30, 2024

xiaofan-luan commented May 5, 2024

ghallsimpsons commented May 6, 2024

xiaofan-luan commented May 7, 2024

yanliang567 commented May 7, 2024

longjiquan commented May 8, 2024

ghallsimpsons commented May 8, 2024