-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Milvus standalone random crash with 100% CPU utilisation #32382
Comments
@mihailyanchev the logs attached above seem everything is ok. According to your description, I'd suggest you allocate more resource for milvus container, as you have frequently delete and insert(upsert) operations to milvus. /assign @mihailyanchev |
could you try to upgrade to 2.3.13 an see? And did you see the error of etcd session timeout or something? |
@yanliang567 Thank you for the quick response. Sorry, it took me some time to get the birdwatcher backup. See attached: @xiaofan-luan there were no errors with respect to etcd session timeout. The only errors in the whole milvus-standalone log, were right after the start container:
However, these appeared only a couple of times and this was 2 months ago when the container was originally started. There have been no errors since. Otherwise, from your comments I have the following takeaway:
Let me know if you have any other ideas and thank you! |
@mihailyanchev the meta in etcd backup looks ok. one quick question: the backup is about milvus 2.3.4 or milvus 2.3.13(after upgrading?), did this milvus instance upgrade from 2.2.x? I'm asking because I saw 2 different index versions in the meta. |
please keep us posted as you upgrade to mivlus 2.3.13 and increase/decrease the resource. thanks |
I have not performed the upgrade to 2.3.13 yet, so the current version is still 2.3.4. However, around 2 months ago we upgraded from 2.2.x. I followed this guide. I will let you know how my experimentation goes. Thanks for your help again! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there an existing issue for this?
Environment
Current Behavior
We experienced two random shutdowns of the milvus-standalone container during periods of larger load to the Milvus (mainly insert and delete operations). There are no errors in the logs immediately before the shutdown. We observe 100% CPU utilisation right before the crash and the first time it happened the whole machine crashed too and restarted automatically. The minio and the etcd containers remain alive and running.
We have around 2-4 collections with no more than 200-300K records in each.
The first time this happened it was impossible to return the Milvus-standalone to normal. Every time we restart the container it would go back to 100% cpu utilisation and crash again. After several restarts it managed to remain stable, but it never managed to load its collections, so we had to rebuild everything from scratch.
I am wondering whether the resources are too limited for our setup and whether we have to either allocate more compute resources or move to the distributed mode.
Expected Behavior
I would expect some error log as to what is happening or some graceful way to let us know it needs more resources.
Steps To Reproduce
Milvus Log
The log for this was 1.7GB so I am just posting the very begging and the very end before the crash.
The beginning:
The end:
Anything else?
No response
The text was updated successfully, but these errors were encountered: