-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods failing their health check after upgrading to operator 1.1.2 #173
Comments
Hi @denniskorbginski, I couldn't reproduce it (i.e. the udp port opening) with the config you shared. I think you have some monitoring/load-balancer setup (targeting the pods) in your environment that opens the port. The thing that it failed after you switched to 1.1.2 is maybe because 6379 was the last entry before (and hence healthcheck script didn't fail). But I agree with your finding. The best approach here would be to set the env by the operator when creating the statefulset so that you don't have to explicitly set it. Will fix, thanks! |
It seems the port was hardcoded before v1.16 release and hence it didn't cause the issue - dragonflydb/dragonfly#2841 (comment). (v1.1.2 uses dragonfly v1.16 by default) |
Hey, with the port in question being 8125, I agree it's very likely that this is caused from a monitoring agent running in my cluster. Maybe I was just lucky with the order of entries returned by netstat before the update. Edit: your other response just came in as I was submitting this comment 😅 great, this explains it. |
Background info for context
Earlier today, I set up the deagonfly operator based on this manifest which deployed version 1.1.1. I applied the below manifest to run dragonfly which is based on the sample linked in the docs. I added the
proactor_threads
argument to prevent the containers from exiting with the error messageThere are 4 threads, so 1.00GiB are required. Exiting...
- not sure if thats in any way related to my issue, but since it's the only change I made to the sample, I thought it's worth mentioning. With this setup, the pods were running fine, at least for an hour or two before I discovered that you had pushed a new version of the operator manifest.My actual issue
After updating the operator with the new manifest, it replaced the dragonfly pods which passed their health checks for a moment, but then started failing them. Checking the healthcheck script, I saw that it tries to autodetect the correct port. I ran netstat in the container and at first, it shows the expected ports and the healthchecks succeeds:
After a few seconds, when the healthcheck begins to fail, there is another entry listed:
The healthcheck then tries to run against port 8125 and fails. I'm not sure if this can be specific to my setup or is a general problem. It was easy enough to work around the issue by setting the
HEALTHCHECK_PORT
env var to 6379.Even if this is not just me, this hardly feels like a bug in the operator, but maybe the healthcheck script could be improved to handle this case better? Or should the sample manifest and the docs be updated to include the
HEALTHCHECK_PORT
env var?The text was updated successfully, but these errors were encountered: