Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot receive topic='nagios' #211

Open
psdhami09 opened this issue Oct 30, 2018 · 14 comments
Open

Cannot receive topic='nagios' #211

psdhami09 opened this issue Oct 30, 2018 · 14 comments

Comments

@psdhami09
Copy link

Hi Hari,

we are using perl nagios plugin in our environment. we have recently deployed it. however we have observed some blips in nagios trends for kafka brokers, we thought that plugin went critical because kafka is having issue, but that's not the case, these blips are coming frequntly and pluging reports below error:

State info: CRITICAL: Error: Cannot receive: topic='nagios'

Not sure if this is known error in plugin, could you please suggest

image

@psdhami09
Copy link
Author

Below are some more details:

define command{
command_name check_kafka
command_line /usr/local/nagios/libexec/nagios-plugins/check_kafka.pl -H $HOSTADDRESS$ -P $ARG1$ -T $ARG2$ -R $ARG3$
}

check_command check_kafka!9092!nagios!ISR

@psdhami09
Copy link
Author

Zoom in for screen shot, which displays plugin error:

image

@HariSekhon
Copy link
Owner

Have you tried the Python version for comparison? It may yield a different error message as this is generated from the underlying library. I personally prefer the python version now.

@psdhami09
Copy link
Author

Hey Hari,

I have tried Python version, it doesn't show same error again but there is another status info, below is the same:

Status Info: Initial Service Pseudo-State

image

Could you please confirm if we need to worry about this message. If you can give some details like when this message appears, will be great help for us.

Thanks
Pritpal

@HariSekhon
Copy link
Owner

HariSekhon commented Nov 5, 2018

Please run it on the command line with the -v -v -v switches to get full debug output and paste full output here. Might be worth doing for both the Perl and Python versions of check_kafka as they both support this for debug logging.

You can use anonymize.py from DevOps Python Tools repo if you want to redact your hostname/IP addresses from the text before pasting it here.

@psdhami09
Copy link
Author

Hi Hari,

Thanks for the response

Here is the output for Python version:

user@788252a7:/usr/local/nagios/libexec/nagios-plugins# ./check_kafka.py -v -v -v -H IBUS-ibus-1 -P 9092 -T nagios
2018-11-08 17:49:07,709 - cli.pyparse_timeout:387 - DEBUG - getting $TIMEOUT value None
2018-11-08 17:49:07,709 - cli.pyparse_timeout:397 - DEBUG - timeout not set, using default timeout 10
2018-11-08 17:49:07,710 - utils.pylog_option:2213 - INFO - timeout: 10
2018-11-08 17:49:07,710 - cli.pytimeout:254 - DEBUG - setting timeout to 10 secs
2018-11-08 17:49:07,710 - cli.pymain:159 - INFO - Hari Sekhon check_kafka.py version 0.5.2 => CLI version 0.3 => Utils version 0.11.5
2018-11-08 17:49:07,710 - cli.pymain:160 - INFO - https://github.com/harisekhon/nagios-plugins
2018-11-08 17:49:07,710 - cli.pymain:161 - INFO - verbose level: 3 (DEBUG)
2018-11-08 17:49:07,710 - utils.pylog_option:2213 - INFO - timeout: 10
2018-11-08 17:49:07,710 - cli.pymain:164 - DEBUG - setting timeout alarm (10)
2018-11-08 17:49:07,735 - utils.pylog_option:2213 - INFO - host:port: IBUS-ibus-1:9092
2018-11-08 17:49:07,735 - utils.pylog_option:2213 - INFO - brokers: IBUS-ibus-1:9092
2018-11-08 17:49:07,736 - utils.pylog_option:2213 - INFO - topic: nagios
2018-11-08 17:49:07,736 - check_kafka.pyprocess_partitions:207 - INFO - partition not specified, getting random partition
2018-11-08 17:49:08,843 - check_kafka.pyprocess_partitions:209 - INFO - selected partition 0
2018-11-08 17:49:08,843 - utils.pylog_option:2213 - INFO - partition: 0
2018-11-08 17:49:08,844 - utils.pylog_option:2213 - INFO - acks: 1
2018-11-08 17:49:08,844 - threshold.pyinit:50 - DEBUG - warning threshold simple = upper
2018-11-08 17:49:08,844 - threshold.pyinit:51 - DEBUG - warning threshold positive = True
2018-11-08 17:49:08,844 - threshold.pyinit:52 - DEBUG - warning threshold integer = True
2018-11-08 17:49:08,844 - threshold.pyinit:53 - DEBUG - warning threshold min = None
2018-11-08 17:49:08,844 - threshold.pyinit:54 - DEBUG - warning threshold max = None
2018-11-08 17:49:08,844 - threshold.pyparse_threshold:72 - DEBUG - warning threshold given = '1'
2018-11-08 17:49:08,844 - threshold.pyparse_threshold:106 - DEBUG - warning threshold upper boundary = 1.0
2018-11-08 17:49:08,845 - threshold.pyparse_threshold:107 - DEBUG - warning threshold lower boundary = None
2018-11-08 17:49:08,845 - utils.pylog_option:2213 - INFO - warning: 1
2018-11-08 17:49:08,845 - threshold.pyinit:50 - DEBUG - critical threshold simple = upper
2018-11-08 17:49:08,845 - threshold.pyinit:51 - DEBUG - critical threshold positive = True
2018-11-08 17:49:08,845 - threshold.pyinit:52 - DEBUG - critical threshold integer = True
2018-11-08 17:49:08,845 - threshold.pyinit:53 - DEBUG - critical threshold min = None
2018-11-08 17:49:08,845 - threshold.pyinit:54 - DEBUG - critical threshold max = None
2018-11-08 17:49:08,845 - threshold.pyparse_threshold:72 - DEBUG - critical threshold given = '2'
2018-11-08 17:49:08,845 - threshold.pyparse_threshold:106 - DEBUG - critical threshold upper boundary = 2.0
2018-11-08 17:49:08,845 - threshold.pyparse_threshold:107 - DEBUG - critical threshold lower boundary = None
2018-11-08 17:49:08,845 - utils.pylog_option:2213 - INFO - critical: 2
2018-11-08 17:49:08,845 - pubsub_nagiosplugin.pyrun:117 - INFO - subscribing
2018-11-08 17:49:09,263 - check_kafka.pysubscribe:273 - DEBUG - partition assignments: set([])
2018-11-08 17:49:09,263 - check_kafka.pysubscribe:279 - DEBUG - assigning partition 0 to consumer
2018-11-08 17:49:09,264 - check_kafka.pysubscribe:282 - DEBUG - partition assignments: set([TopicPartition(topic='nagios', partition=0)])
2018-11-08 17:49:09,264 - check_kafka.pysubscribe:284 - DEBUG - getting current offset
2018-11-08 17:49:09,320 - check_kafka.pysubscribe:292 - DEBUG - recorded starting offset '4576'
2018-11-08 17:49:09,320 - pubsub_nagiosplugin.pyrun:119 - INFO - publishing message "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'"
2018-11-08 17:49:09,320 - check_kafka.pypublish:296 - DEBUG - creating producer
2018-11-08 17:49:09,722 - check_kafka.pypublish:308 - DEBUG - producer.send()
2018-11-08 17:49:09,722 - check_kafka.pypublish:315 - DEBUG - producer.flush()
2018-11-08 17:49:09,738 - pubsub_nagiosplugin.pyrun:124 - INFO - published in 0.418 secs
2018-11-08 17:49:09,738 - pubsub_nagiosplugin.pyrun:129 - INFO - consuming message
2018-11-08 17:49:09,738 - check_kafka.pyconsume:320 - DEBUG - consumer.seek(4576)
2018-11-08 17:49:09,738 - check_kafka.pyconsume:323 - DEBUG - consumer.poll(timeout_ms=4500.0)
2018-11-08 17:49:09,796 - check_kafka.pyconsume:325 - DEBUG - msg object returned: {TopicPartition(topic=u'nagios', partition=0): [ConsumerRecord(topic=u'nagios', partition=0, offset=4576, timestamp=1541699349722, timestamp_type=0, key='check_kafka.py-zUZ7gE8Gv6Sy2HrpXqnN', value="Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'", checksum=None, serialized_key_size=35, serialized_value_size=156)]}
2018-11-08 17:49:09,796 - pubsub_nagiosplugin.pyrun:133 - INFO - consumed in 0.058 secs
2018-11-08 17:49:09,796 - pubsub_nagiosplugin.pyrun:134 - INFO - consumed message = "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'"
2018-11-08 17:49:09,796 - pubsub_nagiosplugin.pyend:156 - INFO - checking consumed message "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'" == published message "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'"
OK: Kafka message published and consumed back successfully, published in 0.418 secs, consumed in 0.058 secs, total time = 0.951 secs | publish_time=0.418s;1;2 consume_time=0.058s;1;2 total_time=0.951s
user@78152a7:/usr/local/nagios/libexec/nagios-plugins#

And here is for Perl Version:

user@8122a7:/usr/local/nagios/libexec/nagios-plugins# ./check_kafka.pl -v -v -v -H IBUS-ibus-1 -P 9092 -T nagios
verbose mode on

check_kafka.pl version 0.3 => Hari Sekhon Utils version 1.19.2

host: IBUS-ibus-1
port: 9092
topic: nagios
required acks: 1
send-max-attempts: 1
receive-max-attempts: 1
retry-backoff: 200
sleep: 0.5

setting timeout to 10 secs

connecting to Kafka broker at IBUS-ibus-1:9092

Metadata:

Kafka topic 'AC_ADAPTER_COMMAND_RESPONSE_SMS_FINCH' partitions:
Partition: 0 Replicas: 1,2,3 ISR: 1,3,2 Leader: 1
Partition: 1 Replicas: 2,3,1 ISR: 3,2,1 Leader: 2
Partition: 2 Replicas: 3,1,2 ISR: 3,2,1 Leader: 3
Partition: 3 Replicas: 1,3,2 ISR: 1,3,2 Leader: 1
Partition: 4 Replicas: 2,1,3 ISR: 1,3,2 Leader: 2
Partition: 5 Replicas: 3,2,1 ISR: 3,2,1 Leader: 3
Partition: 6 Replicas: 1,2,3 ISR: 1,3,2 Leader: 1
Partition: 7 Replicas: 2,3,1 ISR: 3,2,1 Leader: 2

Kafka topic 'SP.CVPECUNOREQ' partitions:
UNKNOWN: 'SP.CVPECUNOREQ' 'SP' field not found. API may have changed. Please try latest version from https://github.com/harisekhon/nagios-plugins, re-run on command line with -vvv and if problem persists paste full output from -vvv mode in to a ticket requesting a fix/update at https://github.com/harisekhon/nagios-plugins/issues/new
user@78152a:/usr/local/nagios/libexec/nagios-plugins#

@ethan-riskiq
Copy link

Having the same issue with our production check, although it only appears to be happening on one host. Will paste debugging output here when I can reproduce it from the cli.

@ethan-riskiq
Copy link

ethan-riskiq commented Nov 13, 2018

verbose mode on

check_kafka.pl version 0.2.6  =>  Hari Sekhon Utils version 1.18.6

host:                     br01
port:                     6667
topic:                    nagios
partition:                0
required acks:            1
send-max-attempts:        1
receive-max-attempts:     1
retry-backoff:            200
sleep:                    0.5

setting timeout to 60 secs

connecting to Kafka broker at br01:6667
CRITICAL: failed to get metadata, broker offline or wrong port? (some deployments use 9092, some such as Hortonworks use 6667)

real	0m8.236s
user	0m0.361s
sys	0m0.049s

[root@mon1 ~]# time /usr/lib64/nagios/nagios-plugins/check_kafka.pl -v -v -H br01 -P 6667 --topic nagios
verbose mode on

host:                     br01
port:                     6667
topic:                    nagios
partition:                0
required acks:            1
send-max-attempts:        1
receive-max-attempts:     1
retry-backoff:            200
sleep:                    0.5

setting timeout to 10 secs

connecting to Kafka broker at br01:6667
connecting producer
connecting consumer
CRITICAL: Error: Can't get metadata: topic = 'nagios'

real	0m7.478s
user	0m0.374s
sys	0m0.041s```

@ethan-riskiq
Copy link

the port is definitely online, service is responsive

@HariSekhon
Copy link
Owner

@ethan-riskiq Did you try the python check_kafka.py plugin to see if it gives a more informative error than the Perl API is returning?

@HariSekhon
Copy link
Owner

HariSekhon commented Nov 13, 2018

Also, I think you should use one more -v switch, three levels of verbose will get debug output including more output from the API.

@ethan-riskiq
Copy link

I have not been a able to reproduce the issue via the python version of the script. I've added a "-new" check that's using the python verison of the script and will see if similar issues occur when the perl script returns this error.

@ethan-riskiq
Copy link

https://gist.github.com/ethan-riskiq/25a2168b8143c8a59c807c41344154dc gist of python debug output. got UnknownError: failed to find matching consumer record with key error

@HariSekhon
Copy link
Owner

HariSekhon commented Nov 26, 2018

That's an old version of check_kafka.py - 0.3.9, current is 0.5.3.

Can you please make update and then try again with the latest version so that the traceback matches the current code for debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants