New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Milvus Cluster encounters rpc error: code = DeadlineExceeded desc = context deadline exceeded probably when loading partitions #32763
Comments
/assign @congqixia |
We are dealing with a Milvus Database which comprise of:
so you are saying the cluster has 2144 collections and 14643 partitions? but only 43000 vectors? |
from the log you offered, I don't see any DEADLINE_EXCEEDED logs. and the only problem is too many collections and partitions |
Yes, as mentioned earlier, issue is sporadic in nature and hence it has become difficult to triage. Please find logs from May-02, 0502-logs.zip which includes several |
We understand this might not be a typical use-case of Milvus DB, however, it is per our application design that we have to go with multiple collections and partitions for relatively few vectors. Please be noted that is a PRD application which has been working with no issues since at least 5mo. |
you can try to put data in single collection and use partition key feature and filtering on the datas. Filtering can help to reduce meta information. the collection number and partition number can not be increased unlimitedly. We recommend to have less than 10000 collections and less than 65536 partition * collections |
Appreciate for sharing your recommendations. We have considered Milvus Limits while designing our application to ensure we do not hit any hard limits. Can you please comment on what could have possibly led our platform to encounter such |
By this statement, you are recommending that we keep the product of total partitions and collections under 65536 while collections alone under 10K. Can you please confirm ? |
yes. and the best practice is keep even less than that |
we are working on this, we are working on 10000 collections campaign to improve the performance |
Any insights on the error message and what's causing it? It appears Milvus is indeed not working as expected and we would like to seek your support in learning of root cause. We have received bunch of these errors while attempting to |
didn't load timeout? |
Yes, our clients using PyMilvus SDK's received below error message:
However, unlike in the past, Milvus didn't log any errors today which you may find at 0506-milvus-prod-logs.tar.gz taken for the past 12h. |
We have been receiving similar error messages while attempting to load partitions. However, the occurrence of this error message is sporadic in nature.
|
FYI @xiaofan-luan |
we need full log of querynode and querycoord to understand what happens. this is how the log how be collected and I would say it's very dangerous to maintain too much collection and partition in one cluster and this is not recommended |
if you can check querynode warn log you may get some ideas about why the load fails. |
You may find milvus-logs 0508-milvus-logs.tar.gz taken today with since argument set to FYI, all the logs that I have shared earlier in the thread were captured using export-milvus-log.sh |
We are seeing a lot of error messages on
Considering Error Code: EIO an indicator for Input/Output errors, I have investigated into the disk / file system failures, but everything seems to be alright. I have used |
|
ioutil and disktuil |
I do not see any errors logged on pulsar-bookie pods but only on pulsar-recovery. I have scaled up and up-sized pulsar-recovery and restarted both bookies and recovery. No help. We are on AWS; EBS Average Queue Length is barely 0.2-0.5 in the past 2 weeks. Can you inspect Error code and confirm that we are dealing with a disk IO issue or anything else ? |
I believe these errors are linked to our degraded performance on our application. It takes around 30s-40s and sometime over 2 minutes to flush a partition. |
Please refer to the ERROR messages in STDOUT for |
from the log you offered, everything on milvus seems to be working as expected. you can verify this by pprof the coordinator node and see how much cpu usage does it cost. |
we fixed many of the issues on 2.4.1, like #32831 |
I'm guessing you are referring to DataCoord, are you? I haven't used pprof but I gathered metrics from CloudWatch and Prometheus which indicate we are using good amount of CPU and Mem across most of our xxCoord and Nodes, but they are not even close to what's available for them. We have a beefy environment with EKS Node on |
Initially when I opened this discussion, our application was on Milvus v2.3.1. Considering the performance issues we have moved to our application to our passive Milvus deployment on v2.3.10. Doesn't 2.3.10 have this bugfix / new design? Fundamentally, we are trying to understand what's wrong. That'll help us ensure we don't hit similar issue on v2.4.1. |
query, datacoord, could spent tons of cpus. what is the current performance problem you are facing? |
Flushing a partition with less than 15 vectors takes around 40s and sometime 270s. CPU and Memory utilization for DataCoord / Nodes and bookies indicate they are not maximally used than provisioned. Perhaps you may review our configuration on the bookies' Pods and JVM and share optimal config. I am concerned with bookies since they are the only components which current throw error messages and flushing operation is related to bookies. Let me know if you think otherwise. Your help is greatly appreciated. |
I have deployed Milvus v2.4.1 and currently in the process of migration our PRD workloads to this new environment. I'll keep you posted on how it goes. Thanks for your support. |
@xiaofan-luan When you say less than 65536 partition * collections, do you mean physical partitions or does it apply to partition-key based storage too. |
Considering our requirements for our CVP Stack, we are left with choosing "Partition-key-based" approach which supports our requirements for the maximum number of tenants. It has been learned that bulk-insert entities are not supported for collections with Partition Key enabled. We rely on Milvus-backup to back up our databases for disaster recovery purposes. I believe the backup tool makes use of bulk-insert for the Could you please comment on whether the backup tool would work should we attempt to back up/restore a collection with a Partition-Key enabled? Please advise on any alternatives if it doesn't. |
@ganderaj in latest Milvus release v2.3.15 or v2.4.x, you can bulk insert into collevtion with parition key eabled. We will update the doc https://milvus.io/docs/multi_tenancy.md immediately. BTW, you can also do backup and restore with backup tool with partition key enabled in latest release. |
Is there an existing issue for this?
Environment
Current Behavior
Our Milvus cluster which is built on AWS EKS did not receive any changes and neither did the client application. However, lately we have noticed an issue which is sporadic in nature where application trying to load a
partition
hangs up with an error message:It has been observed on Milvus that it would log a similar error on its
QueryCoord
:We have investigated the resource utilization of Milvus and found it is barely used (~5%) and every Pod / Deployment is in healthy status. Though it is in our roadmap to upgrade our Milvus to one of the latest releases, we would like to understand the root cause and how do we ensure we don't encounter the same post upgrade.
Expected Behavior
Loading a Partition and performing Queries would go through with minimal latency and of course with no issues.
Steps To Reproduce
Milvus Log
Issue appears to be sporadic in nature and at times milvus-logs do not indicate error message but the client SDK receives error messages. Attached are milvus-prod-logs.tar.gz which are taken on May-05 for the past 2days.
Anything else?
No response
The text was updated successfully, but these errors were encountered: