Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Log message when trying to roll pending pod is misleading #9427

Open
scholzj opened this issue Dec 4, 2023 · 1 comment
Open

[Bug] Log message when trying to roll pending pod is misleading #9427

scholzj opened this issue Dec 4, 2023 · 1 comment
Labels

Comments

@scholzj
Copy link
Member

scholzj commented Dec 4, 2023

The KafkaRoller detects stuck pods while rolling the Kafka cluster and does not seem to wait for them to get ready. It results in the following messages in the log:

2023-12-04 09:09:33 INFO  ClusterOperator:142 - Triggering periodic reconciliation for namespace myproject
2023-12-04 09:09:33 INFO  AbstractOperator:265 - Reconciliation #19(timer) Kafka(myproject/my-cluster): Kafka my-cluster will be checked for creation or modification
2023-12-04 09:09:33 INFO  KafkaRoller:382 - Reconciliation #19(timer) Kafka(myproject/my-cluster): Could not verify pod my-cluster-controllers-2/2 is up-to-date, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Pod is unschedulable or is not starting
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.checkIfRestartOrReconfigureRequired(KafkaRoller.java:598) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:462) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$7(KafkaRoller.java:376) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2023-12-04 09:09:33 ERROR AbstractOperator:284 - Reconciliation #19(timer) Kafka(myproject/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Pod is unschedulable or is not starting
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.checkIfRestartOrReconfigureRequired(KafkaRoller.java:598) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:462) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$7(KafkaRoller.java:376) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2023-12-04 09:09:33 WARN  AbstractOperator:557 - Reconciliation #19(timer) Kafka(myproject/my-cluster): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Pod is unschedulable or is not starting
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.checkIfRestartOrReconfigureRequired(KafkaRoller.java:598) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:462) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$7(KafkaRoller.java:376) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]

I'm not sure if not waiting for their readiness is intentional or not => it might have its reasons (and anyway, the next periodic reconciliation will check it again latest in few minutes, so it is not a problem per-se). But in any case, if you check the timestamps, it is clear that this message is misleading:

Could not verify pod my-cluster-controllers-2/2 is up-to-date, giving up after 10 attempts. Total delay between attempts 127750ms

Not sure if it tried something 10 times. But it did not wait for 127750ms as the whole reconciliation happened from the start till the end within 1 second. So we should try to fix the message to avoid misleading people when analyzing it.

Note: This seems to be a general issue that applies to controllers, brokers, mixed nodes and even in ZooKeeper-based clusters.

@scholzj
Copy link
Member Author

scholzj commented Dec 14, 2023

Discussed in the community call on 14.12.: Does not seem like a high priority, but should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant