Chaos monkey test fails / Cluster does not survive reboot #404

sbernauer · 2023-10-12T08:45:18Z

During #400 we noticed (again), that HBase 2.4 does weird DNS roulette.
It was uncovered by adding a chaos monkey test similar to what we already have for HDFS in place.

When running the chaos monkey test, HBase 2.4 returns random DNS failures, such as

2023-10-11 13:27:58,532 INFO  [master/test-hbase-master-default-0:16000:becomeActiveMaster] retry.RetryInvocationHandler: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local:8020 after 13 failover attempts. Trying to failover after sleeping for 21829ms.

or

2023-10-11 13:29:01,311 WARN  [master/test-hbase-master-default-1:16000:becomeActiveMaster] ipc.Client: Address change detected. Old: test-hdfs-namenode-default-0.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.188:8020 New: test-hdfs-namenode-default-0.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.208:8020
2023-10-11 13:29:21,341 WARN  [master/test-hbase-master-default-1:16000:becomeActiveMaster] ipc.Client: Address change detected. Old: test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.173:8020 New: test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.210:8020
2023-10-11 13:29:42,657 INFO  [master/test-hbase-master-default-1:16000:becomeActiveMaster] retry.RetryInvocationHandler: org.apache.hadoop.net.ConnectTimeoutException: Call From test-hbase-master-default-1/10.244.0.201 to test-hdfs-namenode-default-0.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=test-hdfs-namenode-default-0.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.188:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout, while invoking ClientNamenodeProtocolTranslatorPB.setSafeMode over test-hdfs-namenode-default-0.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.188:8020 after 2 failover attempts. Trying to failover after sleeping for 2803ms.
2023-10-11 13:29:21,342 INFO  [master/test-hbase-master-default-1:16000:becomeActiveMaster] retry.RetryInvocationHandler: org.apache.hadoop.net.ConnectTimeoutException: Call From test-hbase-master-default-1/10.244.0.201 to test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.173:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout, while invoking ClientNamenodeProtocolTranslatorPB.setSafeMode over test-hdfs-namenode-default-1.test-hdfs-namenode-default.kuttl-test-joint-sloth.svc.cluster.local/10.244.0.173:8020 after 1 failover attempts. Trying to failover after sleeping for 1296ms.

We also tried HBase 2.5, which causes the Phoenix test to be flaky. Works half of the time, or otherwise fails with some timeout error

The text was updated successfully, but these errors were encountered:

sbernauer added type/bug priority/high type/internal-debt labels Oct 12, 2023

sbernauer changed the title ~~Let chaos monkey test pass~~ Chaos monkey test fails Oct 12, 2023

sbernauer changed the title ~~Chaos monkey test fails~~ Chaos monkey test fails / Cluster does not survive reboot Oct 12, 2023

lfrancke added the size/M label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaos monkey test fails / Cluster does not survive reboot #404

Chaos monkey test fails / Cluster does not survive reboot #404

sbernauer commented Oct 12, 2023 •

edited

Chaos monkey test fails / Cluster does not survive reboot #404

Chaos monkey test fails / Cluster does not survive reboot #404

Comments

sbernauer commented Oct 12, 2023 • edited

sbernauer commented Oct 12, 2023 •

edited