[Bug] Log message when trying to roll pending pod is misleading #9427

scholzj · 2023-12-04T10:21:59Z

The KafkaRoller detects stuck pods while rolling the Kafka cluster and does not seem to wait for them to get ready. It results in the following messages in the log:

2023-12-04 09:09:33 INFO  ClusterOperator:142 - Triggering periodic reconciliation for namespace myproject
2023-12-04 09:09:33 INFO  AbstractOperator:265 - Reconciliation #19(timer) Kafka(myproject/my-cluster): Kafka my-cluster will be checked for creation or modification
2023-12-04 09:09:33 INFO  KafkaRoller:382 - Reconciliation #19(timer) Kafka(myproject/my-cluster): Could not verify pod my-cluster-controllers-2/2 is up-to-date, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Pod is unschedulable or is not starting
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.checkIfRestartOrReconfigureRequired(KafkaRoller.java:598) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:462) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$7(KafkaRoller.java:376) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2023-12-04 09:09:33 ERROR AbstractOperator:284 - Reconciliation #19(timer) Kafka(myproject/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Pod is unschedulable or is not starting
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.checkIfRestartOrReconfigureRequired(KafkaRoller.java:598) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:462) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$7(KafkaRoller.java:376) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2023-12-04 09:09:33 WARN  AbstractOperator:557 - Reconciliation #19(timer) Kafka(myproject/my-cluster): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Pod is unschedulable or is not starting
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.checkIfRestartOrReconfigureRequired(KafkaRoller.java:598) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:462) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$7(KafkaRoller.java:376) ~[io.strimzi.cluster-operator-0.39.0-SNAPSHOT.jar:0.39.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]

I'm not sure if not waiting for their readiness is intentional or not => it might have its reasons (and anyway, the next periodic reconciliation will check it again latest in few minutes, so it is not a problem per-se). But in any case, if you check the timestamps, it is clear that this message is misleading:

Could not verify pod my-cluster-controllers-2/2 is up-to-date, giving up after 10 attempts. Total delay between attempts 127750ms

Not sure if it tried something 10 times. But it did not wait for 127750ms as the whole reconciliation happened from the start till the end within 1 second. So we should try to fix the message to avoid misleading people when analyzing it.

Note: This seems to be a general issue that applies to controllers, brokers, mixed nodes and even in ZooKeeper-based clusters.

The text was updated successfully, but these errors were encountered:

scholzj · 2023-12-14T08:41:12Z

Discussed in the community call on 14.12.: Does not seem like a high priority, but should be fixed.

scholzj added bug needs-triage labels Dec 4, 2023

scholzj removed the needs-triage label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Log message when trying to roll pending pod is misleading #9427

[Bug] Log message when trying to roll pending pod is misleading #9427

scholzj commented Dec 4, 2023 •

edited

scholzj commented Dec 14, 2023

[Bug] Log message when trying to roll pending pod is misleading #9427

[Bug] Log message when trying to roll pending pod is misleading #9427

Comments

scholzj commented Dec 4, 2023 • edited

scholzj commented Dec 14, 2023

scholzj commented Dec 4, 2023 •

edited