Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus Cluster encounters rpc error: code = DeadlineExceeded desc = context deadline exceeded probably when loading partitions #32763

Open
1 task done
ganderaj opened this issue May 5, 2024 · 34 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ganderaj
Copy link

ganderaj commented May 5, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.3.1
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): Pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): v2.3.6 or above
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Our Milvus cluster which is built on AWS EKS did not receive any changes and neither did the client application. However, lately we have noticed an issue which is sporadic in nature where application trying to load a partition hangs up with an error message:

ERROR:pymilvus.decorators:grpc RpcError: [load_partitions], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-04-29 14:43:46.221321', 'gRPC error': '2024-04-29 14:44:46.222777'}>
ERROR:app.milvus_helper:Error in get_context Milvus:  <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-04-29T14:44:46.222455187+00:00", grpc_status:4}"

It has been observed on Milvus that it would log a similar error on its QueryCoord:

[ERROR] [querycoordv2/server.go:186] ["QueryCoord init failed"] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:485 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:499 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/client/client.go:120 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/client/client.go:196 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).DescribeCollection\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:76 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetCollectionSchema\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/collection_manager.go:133 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CollectionManager).Recover\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/server.go:311 github.com/milvus-io/milvus/internal/querycoordv2.(*Server).initMeta\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/server.go:229 github.com/milvus-io/milvus/internal/querycoordv2.(*Server).initQueryCoord\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/server.go:185 github.com/milvus-io/milvus/internal/querycoordv2.(*Server).Init.func1: attempt #0: rpc error: code = DeadlineExceeded desc = context deadline exceeded: context done during sleep after run#0: context deadline exceeded"] [stack="github.com/milvus-io/milvus/internal/querycoordv2.(*Server).Init.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/server.go:186\ngithub.com/milvus-io/milvus/internal/util/sessionutil.(*Session).ProcessActiveStandBy\n\t/go/src/github.com/milvus-io/milvus/internal/util/sessionutil/session_util.go:1031\ngithub.com/milvus-io/milvus/internal/querycoordv2.(*Server).Register\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/server.go:139\ngithub.com/milvus-io/milvus/internal/distributed/querycoord.(*Server).start\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/service.go:262\ngithub.com/milvus-io/milvus/internal/distributed/querycoord.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/service.go:100\ngithub.com/milvus-io/milvus/cmd/components.(*QueryCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/query_coord.go:53\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:112"]

We have investigated the resource utilization of Milvus and found it is barely used (~5%) and every Pod / Deployment is in healthy status. Though it is in our roadmap to upgrade our Milvus to one of the latest releases, we would like to understand the root cause and how do we ensure we don't encounter the same post upgrade.

Expected Behavior

Loading a Partition and performing Queries would go through with minimal latency and of course with no issues.

Steps To Reproduce

We are dealing with a Milvus Database which comprise of:
- Collections: 2144
- Partitions: 14643
- Vectors: 43713

Please note that the distribution of vectors and partitions across collections is not uniform and it is our understanding that `partition.load(timeout=60)` leads to this error message.

Milvus Log

Issue appears to be sporadic in nature and at times milvus-logs do not indicate error message but the client SDK receives error messages. Attached are milvus-prod-logs.tar.gz which are taken on May-05 for the past 2days.

Anything else?

No response

@ganderaj ganderaj added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 5, 2024
@yanliang567
Copy link
Contributor

/assign @congqixia
/unassign

@sre-ci-robot sre-ci-robot assigned congqixia and unassigned yanliang567 May 6, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 6, 2024
@yanliang567 yanliang567 added this to the 2.3.15 milestone May 6, 2024
@xiaofan-luan
Copy link
Contributor

We are dealing with a Milvus Database which comprise of:

  • Collections: 2144
  • Partitions: 14643
  • Vectors: 43713

so you are saying the cluster has 2144 collections and 14643 partitions? but only 43000 vectors?
is there a reason we need to create so many collections but not use partition key to split the data

@xiaofan-luan
Copy link
Contributor

from the log you offered, I don't see any DEADLINE_EXCEEDED logs. and the only problem is too many collections and partitions

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

from the log you offered, I don't see any DEADLINE_EXCEEDED logs. and the only problem is too many collections and partitions

Yes, as mentioned earlier, issue is sporadic in nature and hence it has become difficult to triage. Please find logs from May-02, 0502-logs.zip which includes several DeadlineExceeded errors.

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

We are dealing with a Milvus Database which comprise of:

  • Collections: 2144
  • Partitions: 14643
  • Vectors: 43713

so you are saying the cluster has 2144 collections and 14643 partitions? but only 43000 vectors? is there a reason we need to create so many collections but not use partition key to split the data

We understand this might not be a typical use-case of Milvus DB, however, it is per our application design that we have to go with multiple collections and partitions for relatively few vectors. Please be noted that is a PRD application which has been working with no issues since at least 5mo.

@xiaofan-luan
Copy link
Contributor

We are dealing with a Milvus Database which comprise of:

  • Collections: 2144
  • Partitions: 14643
  • Vectors: 43713

so you are saying the cluster has 2144 collections and 14643 partitions? but only 43000 vectors? is there a reason we need to create so many collections but not use partition key to split the data

We understand this might not be a typical use-case of Milvus DB, however, it is per our application design that we have to go with multiple collections and partitions for relatively few vectors. Please be noted that is a PRD application which has been working with no issues since at least 5mo.

you can try to put data in single collection and use partition key feature and filtering on the datas.
see https://milvus.io/docs/multi_tenancy.md

Filtering can help to reduce meta information. the collection number and partition number can not be increased unlimitedly.

We recommend to have less than 10000 collections and less than 65536 partition * collections

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

Appreciate for sharing your recommendations. We have considered Milvus Limits while designing our application to ensure we do not hit any hard limits.

Can you please comment on what could have possibly led our platform to encounter such DeadlineExceeded errors while our cluster reports indicate every K8s asset is healthy? Can you please educate us on what does this error message actually mean and how do we avoid it in future?

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

We recommend to have less than 10000 collections and less than 65536 partition * collections

By this statement, you are recommending that we keep the product of total partitions and collections under 65536 while collections alone under 10K. Can you please confirm ?

@xiaofan-luan
Copy link
Contributor

yes. and the best practice is keep even less than that

@xiaofan-luan
Copy link
Contributor

we are working on this, we are working on 10000 collections campaign to improve the performance

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

Any insights on the error message and what's causing it?

It appears Milvus is indeed not working as expected and we would like to seek your support in learning of root cause. We have received bunch of these errors while attempting to partition.load(timeout=60) while a few of them actually succeeded. Included below is a snippet from Grafana Dashboard indicating major proportion of load requests ended up failing.

image

@xiaofan-luan
Copy link
Contributor

didn't load timeout?

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

Yes, our clients using PyMilvus SDK's received below error message:

grpc RpcError: [load_partitions], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-05-06 11:38:10.918914', 'gRPC error': '2024-05-06 11:39:10.932805'}>

However, unlike in the past, Milvus didn't log any errors today which you may find at 0506-milvus-prod-logs.tar.gz taken for the past 12h.

@ganderaj
Copy link
Author

ganderaj commented May 6, 2024

We have been receiving similar error messages while attempting to load partitions. However, the occurrence of this error message is sporadic in nature.

{
    "log": "2024-05-06T19:39:47.138200174Z stdout F [2024/05/06 19:39:47.137 +00:00] [ERROR] [retry/retry.go:42] [\"retry func failed\"] [traceID=8fa85e754e7374538c2675ddd045bf6d] [\"retry time\"=0] [error=\"rpc error: code = Canceled desc = context canceled\"] [stack=\"github.com/milvus-io/milvus/pkg/util/retry.Do\\n\\t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:42\\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call\\n\\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:405\\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\\n\\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:483\\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\\n\\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:499\\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.wrapGrpcCall[...]\\n\\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:109\\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).LoadPartitions\\n\\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:197\\ngithub.com/milvus-io/milvus/internal/proxy.(*loadPartitionsTask).Execute\\n\\t/go/src/github.com/milvus-io/milvus/internal/proxy/task.go:1694\\ngithub.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\\n\\t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:457\\ngithub.com/milvus-io/milvus/internal/proxy.(*taskScheduler).definitionLoop\\n\\t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:484\"]",
    "kubernetes": {
        "pod_name": "milvus-proxy-85bd7dbd76-79pmd",
        "namespace_name": "milvus",
        "pod_id": "433a2fb8-f5e0-45ff-824b-f1f93815c005",
        "host": "ip-10-246-13-233.xxx.com",
        "container_name": "proxy",
        "docker_id": "458179f521efb951943b782facd495d1ccb45148bdf57d714db8659d6aec6e37",
        "container_hash": "docker.io/milvusdb/milvus@sha256:9f37c7100c44be8e7419c55f285c2963ee4d25c71951a2325614a4565a03d7e6",
        "container_image": "docker.io/milvusdb/milvus:v2.3.1"
    }
}```

@ganderaj
Copy link
Author

ganderaj commented May 7, 2024

FYI @xiaofan-luan

@xiaofan-luan
Copy link
Contributor

we need full log of querynode and querycoord to understand what happens.

this is how the log how be collected
https://github.com/milvus-io/milvus/blob/5038036ece044fe605de9323b0c9967cb4f20c57/deployments/export-log/README.md?plain=1

and I would say it's very dangerous to maintain too much collection and partition in one cluster and this is not recommended

@xiaofan-luan
Copy link
Contributor

if you can check querynode warn log you may get some ideas about why the load fails.

@ganderaj
Copy link
Author

ganderaj commented May 8, 2024

Hi @xiaofan-luan

You may find milvus-logs 0508-milvus-logs.tar.gz taken today with since argument set to all. Please refer to the QueryCoord from pod_log and pod_log_previous to find the deadline exceeded errors which occur when we try to load a partition.

FYI, all the logs that I have shared earlier in the thread were captured using export-milvus-log.sh

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

@xiaofan-luan

We are seeing a lot of error messages on milvus-pulsar-recovery. Just wanted to check on your thoughts based on the error:

14:30:00.040 [BookKeeperClientWorker-OrderedExecutor-1-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -1
14:30:00.041 [BookKeeperClientWorker-OrderedExecutor-1-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Read for failed on bookie milvus-pulsar-bookie-1.milvus-pulsar-bookie.milvus.svc.cluster.local:3181 code EIO
14:30:00.041 [BookKeeperClientWorker-OrderedExecutor-1-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Error while reading ledger while reading L838481 E119021 from bookie: milvus-pulsar-bookie-1.milvus-pulsar-bookie.milvus.svc.cluster.local:3181
14:30:00.041 [BookKeeperClientWorker-OrderedExecutor-1-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L838481 E119021-E119021, Sent to [milvus-pulsar-bookie-1.milvus-pulsar-bookie.milvus.svc.cluster.local:3181, milvus-pulsar-bookie-3.milvus-pulsar-bookie.milvus.svc.cluster.local:3181], Heard from [] : bitset = {}, Error = 'Error while reading ledger'. First unread entry is (-1, rc = null)
14:30:00.041 [BookKeeperClientWorker-OrderedExecutor-1-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 119021
org.apache.bookkeeper.client.BKException$BKReadException: Error while reading ledger
	at org.apache.bookkeeper.client.BKException.create(BKException.java:62) ~[org.apache.bookkeeper-bookkeeper-server-4.14.3.jar:4.14.3]
	at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:364) [org.apache.bookkeeper-bookkeeper-server-4.14.3.jar:4.14.3]
	at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:820) [org.apache.bookkeeper-bookkeeper-server-4.14.3.jar:4.14.3]
	at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.14.3.jar:4.14.3]
	at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.14.3.jar:4.14.3]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]
14:30:00.041 [BookKeeperClientWorker-OrderedExecutor-1-0] ERROR

Considering Error Code: EIO an indicator for Input/Output errors, I have investigated into the disk / file system failures, but everything seems to be alright. I have used bookkeeper admin took from the pulsar-bookie pods to validate whether the Legders indeed exist. They show on the listledgers but throw the same error message when attempted to read using readledger.

@xiaofan-luan
Copy link
Contributor

  1. maybe restart bookie and check if the service recovered?
  2. did you got a chance to check the io usage for pulsar

@xiaofan-luan
Copy link
Contributor

ioutil and disktuil

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

I do not see any errors logged on pulsar-bookie pods but only on pulsar-recovery. I have scaled up and up-sized pulsar-recovery and restarted both bookies and recovery. No help.

We are on AWS; EBS Average Queue Length is barely 0.2-0.5 in the past 2 weeks. Can you inspect Error code and confirm that we are dealing with a disk IO issue or anything else ?

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

I believe these errors are linked to our degraded performance on our application. It takes around 30s-40s and sometime over 2 minutes to flush a partition.

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

@xiaofan-luan

Please refer to the ERROR messages in STDOUT for listledger and readledger outputs from one of the bookie pod attached.

bookkeeper_shell.txt

@xiaofan-luan
Copy link
Contributor

from the log you offered, everything on milvus seems to be working as expected.
the reason why flush is slow might be related to too many partitions and segments.

you can verify this by pprof the coordinator node and see how much cpu usage does it cost.

@xiaofan-luan
Copy link
Contributor

we fixed many of the issues on 2.4.1, like #32831

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

I'm guessing you are referring to DataCoord, are you? I haven't used pprof but I gathered metrics from CloudWatch and Prometheus which indicate we are using good amount of CPU and Mem across most of our xxCoord and Nodes, but they are not even close to what's available for them. We have a beefy environment with EKS Node on r6i.4xlarge.

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

@xiaofan-luan

Initially when I opened this discussion, our application was on Milvus v2.3.1. Considering the performance issues we have moved to our application to our passive Milvus deployment on v2.3.10. Doesn't 2.3.10 have this bugfix / new design?

Fundamentally, we are trying to understand what's wrong. That'll help us ensure we don't hit similar issue on v2.4.1.

@xiaofan-luan
Copy link
Contributor

query, datacoord, could spent tons of cpus.
2.3.14 already made some progress on reducing the cpu, 2.4.1 has more optimization.

what is the current performance problem you are facing?

@ganderaj
Copy link
Author

ganderaj commented May 9, 2024

@xiaofan-luan

Flushing a partition with less than 15 vectors takes around 40s and sometime 270s. CPU and Memory utilization for DataCoord / Nodes and bookies indicate they are not maximally used than provisioned.

Perhaps you may review our configuration on the bookies' Pods and JVM and share optimal config. I am concerned with bookies since they are the only components which current throw error messages and flushing operation is related to bookies. Let me know if you think otherwise.

Your help is greatly appreciated.

@ganderaj
Copy link
Author

@xiaofan-luan

I have deployed Milvus v2.4.1 and currently in the process of migration our PRD workloads to this new environment. I'll keep you posted on how it goes.

Thanks for your support.

@moinsaj
Copy link

moinsaj commented May 10, 2024

We are dealing with a Milvus Database which comprise of:

  • Collections: 2144
  • Partitions: 14643
  • Vectors: 43713

so you are saying the cluster has 2144 collections and 14643 partitions? but only 43000 vectors? is there a reason we need to create so many collections but not use partition key to split the data

We understand this might not be a typical use-case of Milvus DB, however, it is per our application design that we have to go with multiple collections and partitions for relatively few vectors. Please be noted that is a PRD application which has been working with no issues since at least 5mo.

you can try to put data in single collection and use partition key feature and filtering on the datas. see https://milvus.io/docs/multi_tenancy.md

Filtering can help to reduce meta information. the collection number and partition number can not be increased unlimitedly.

We recommend to have less than 10000 collections and less than 65536 partition * collections

@xiaofan-luan When you say less than 65536 partition * collections, do you mean physical partitions or does it apply to partition-key based storage too.

@ganderaj
Copy link
Author

https://milvus.io/docs/multi_tenancy.md

Once you set the partition key for a collection, you cannot bulk-insert entities into the collection.

@xiaofan-luan / @moinsaj

Considering our requirements for our CVP Stack, we are left with choosing "Partition-key-based" approach which supports our requirements for the maximum number of tenants.

It has been learned that bulk-insert entities are not supported for collections with Partition Key enabled. We rely on Milvus-backup to back up our databases for disaster recovery purposes. I believe the backup tool makes use of bulk-insert for the restore operation.

Could you please comment on whether the backup tool would work should we attempt to back up/restore a collection with a Partition-Key enabled? Please advise on any alternatives if it doesn't.

@yanliang567
Copy link
Contributor

@ganderaj in latest Milvus release v2.3.15 or v2.4.x, you can bulk insert into collevtion with parition key eabled. We will update the doc https://milvus.io/docs/multi_tenancy.md immediately. BTW, you can also do backup and restore with backup tool with partition key enabled in latest release.

@yanliang567 yanliang567 modified the milestones: 2.3.15, 2.3.16 May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants