Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Load collection is failing for collection with sparse and dense vectors in 2.4 #32757

Open
1 task done
kdabbir opened this issue May 3, 2024 · 7 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@kdabbir
Copy link

kdabbir commented May 3, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.0-latest
- Deployment mode(standalone or cluster): Milvus cluster deployed in kubernetes
- MQ type(rocksmq, pulsar or kafka):  External Kafka (AWS MSK)
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4
- CPU/Memory:
Worker nodes: Data nodes - 15 CPU/54GB memory, Index nodes - 15 CPU/25GB Mem, Query nodes - 14 CPU/118GB
Referred to milvus sizing tool & scaled up 8 index nodes, 8 data nodes, 6 proxy & 25 query nodes.
coordinator nodes:  8 core cpu/16gb memory, Based on c5a.2xlarge.

- Also increased etcd ETCD_MAX_TXN_OPS to 10000 to support more operations

Current Behavior

After inserting over 400 rows for a collection with sparse and dense vectors, collection load was getting stuck and hybrid search query was not working.
Below were the errors I observed in data-worker:

[2024/05/02 17:23:14.783 +00:00] [WARN] [querynodev2/services.go:311] ["failed to load growing segments"] [traceID=4176ad1fefe68560ce2d35431a7d1bca] [collectionID=449477877573946718] [channel=by-dev-milvus24tests-dml_0_449477877573946718v0] [currentNodeID=5] [error="At LoadSegment: std::exception"] [errorVerbose="At LoadSegment: std::exception\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:680\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) At LoadSegment\nWraps: (3) std::exception\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"] [2024/05/02 17:23:14.782 +00:00] [WARN] [delegator/delegator_data.go:359] ["failed to load growing segment"] [traceID=4176ad1fefe68560ce2d35431a7d1bca] [collectionID=449477877573946718] [channel=by-dev-milvus24tests-dml_0_449477877573946718v0] [replicaID=449477880396120068] [error="At LoadSegment: std::exception"] [errorVerbose="At LoadSegment: std::exception\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:680\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) At LoadSegment\nWraps: (3) std::exception\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"] [2024/05/02 17:23:14.776 +00:00] [WARN] [segments/segment_loader.go:612] ["release new segment created due to load failure"] [traceID=4176ad1fefe68560ce2d35431a7d1bca] [collectionID=449477877573946718] [segmentType=Growing] [requestSegments="[449477877574146799]"] [preparedSegments="[449477877574146799]"] [segmentID=449477877574146799] [error="At LoadSegment: std::exception"] [errorVerbose="At LoadSegment: std::exception\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:680\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) At LoadSegment\nWraps: (3) std::exception\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"] [2024/05/02 17:23:14.776 +00:00] [WARN] [segments/segment_loader.go:705] ["failed to load some segments"] [traceID=4176ad1fefe68560ce2d35431a7d1bca] [collectionID=449477877573946718] [segmentType=Growing] [requestSegments="[449477877574146799]"] [preparedSegments="[449477877574146799]"] [error="At LoadSegment: std::exception"] [errorVerbose="At LoadSegment: std::exception\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:680\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) At LoadSegment\nWraps: (3) std::exception\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"] [2024/05/02 17:23:14.776 +00:00] [DEBUG] [funcutil/parallel.go:54] [loadSegmentFunc] [total=1] ["time cost"=18.390614ms] [2024/05/02 17:23:14.776 +00:00] [ERROR] [funcutil/parallel.go:88] [loadSegmentFunc] [error="At LoadSegment: std::exception"] [errorVerbose="At LoadSegment: std::exception\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:680\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) At LoadSegment\nWraps: (3) std::exception\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"] [idx=0] [stack="github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n\t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:88"] [2024/05/02 17:23:14.776 +00:00] [INFO] [segments/segment_loader.go:672] ["load segment done"] [traceID=4176ad1fefe68560ce2d35431a7d1bca] [collectionID=449477877573946718] [segmentType=Growing] [requestSegments="[449477877574146799]"] [preparedSegments="[449477877574146799]"] [partitionID=449477877573946719] [segmentID=449477877574146799] [segmentType=Legacy] [2024/05/02 17:23:14.776 +00:00] [WARN] [segments/segment_loader.go:670] ["load segment failed when load data into memory"] [traceID=4176ad1fefe68560ce2d35431a7d1bca] [collectionID=449477877573946718] [segmentType=Growing] [requestSegments="[449477877574146799]"] [preparedSegments="[449477877574146799]"] [partitionID=449477877573946719] [segmentID=449477877574146799] [segmentType=Legacy] [error="At LoadSegment: std::exception"] [errorVerbose="At LoadSegment: std::exception\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:680\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) At LoadSegment\nWraps: (3) std::exception\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"]

Expected Behavior

Collection load should be successful and hybrid search working

Steps To Reproduce

Attached test script I am using to re-produce this issue in below attachment section. This can be used to re-produce

Milvus Log

Attached export of milvus log in below zip:
milvus-log-6.zip

Attached test script in below zip:
milvus_hybridsearch_test_script.py.zip

Anything else?

No response

@kdabbir kdabbir added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 3, 2024
@kdabbir
Copy link
Author

kdabbir commented May 3, 2024

Just to cross-check if there is an issue with milvus cluster, I ran hello_milvus.py which is returning the correct results and loading records correctly. Seems to be a bug with sparse vectors flow

Screenshot 2024-05-03 at 12 37 48 PM

@yanliang567
Copy link
Contributor

/assign @liliu-z
/unassign

@sre-ci-robot sre-ci-robot assigned liliu-z and unassigned yanliang567 May 4, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 4, 2024
@yanliang567 yanliang567 added this to the 2.4.1 milestone May 4, 2024
@liliu-z
Copy link
Member

liliu-z commented May 6, 2024

/assign @zhengbuqian

@zhengbuqian
Copy link
Collaborator

hi @kdabbir , can you try running the https://github.com/milvus-io/pymilvus/blob/master/examples/hello_sparse.py and the https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py script and see if any error occurs?

I tried to reproduce the error by running your script against my milvus standalone and got no error. From query worker logs I see tons of I20240502 17:29:38.770249 128 Utils.cpp:825] [SERVER][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=perf2-uswest2-cdp1-dpc-milvus, object=milvus-storage/milvus2379test/insert_log/449477877573946718/449477877573946719/449477877574146799/0/449477877575149847], this could be related.

@congqixia
Copy link
Contributor

@kdabbir did your instance share the object storage with other milvus instance?

@zhengbuqian
Copy link
Collaborator

img_v3_02al_2232580e-8eb2-43f1-8308-8feadb6c544g

To supplement @congqixia , the error log basically indicates the segment object file is not found(code 404) in the s3 bucket, causing the load to fail.

@yanliang567 yanliang567 modified the milestones: 2.4.1, 2.4.2 May 7, 2024
@kdabbir
Copy link
Author

kdabbir commented May 8, 2024

thanks for your inputs.I'll see if i can spin up a new cluster and test this functionality out. What is weird is the standard operations were working

Object store is shared but I've specified a different root path. Need to check if there is something wrong with the config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants