Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf not restarting when OPC UA server restarts #15074

Open
john-heywood opened this issue Mar 27, 2024 · 7 comments
Open

Telegraf not restarting when OPC UA server restarts #15074

john-heywood opened this issue Mar 27, 2024 · 7 comments
Labels
bug unexpected problem or unintended behavior waiting for response waiting for response from contributor

Comments

@john-heywood
Copy link

Relevant telegraf.conf

n/a

Logs from Telegraf

not available

System info

Telegraf 1.29.0, Telegraf 1.27.1, Telegraf 1.27.2

Docker

No response

Steps to reproduce

  1. Restart OPC UA server while a Telegraf v1.27.2+ is running.

Expected behavior

When downloading to a PLC running an OPC UA server, I expected the server to restart, Telegraf recognize the change an disconnect, Telegraf reconnect, and Telegraf returning to reading data from the PLC OPC UA server.

Actual behavior

When the OPC UA server restarts, Telegraf loses connection to all the nodes, giving "StatusBadNodeIDUnkown" error messages. This will continue indefinitely until Telegraf is manually restarted, after which it runs as expected.

Additional info

This issue popped up only after upgrading to Telegraf v1.27.2 The version description says that some dependencies were updated related to OPC UA. When I downgrade to Telegraf v1.27.1 this problem does not occur. I also have an instance of Telegraf v1.29.0 where this has occurred.

@john-heywood john-heywood added the bug unexpected problem or unintended behavior label Mar 27, 2024
@powersj
Copy link
Contributor

powersj commented Mar 27, 2024

Hi,

Without logs or a config it is hard to say for sure, but this could also be a duplicate of #13296. You can check out this comment on some of the work that would be required to enable listening for nodes.

When I downgrade to Telegraf v1.27.1 this problem does not occur.

The big change was #13514 which was indented to ensure we were reconnected. Again without seeing logs to understand what is going on I'm not sure what we can do.

@powersj powersj added the waiting for response waiting for response from contributor label Mar 27, 2024
@john-heywood
Copy link
Author

I only have screenshots, sorry. But, here they are. We are not using a listener, just the regular OPC UA input.
log_issue
^ this shows the error messages we have been seeing in versions v1.27.2+ of Telegraf not reconnecting to the OPC UA server.
log_v1_27_1_correct_restart
^ this shows the correct reconnection that occurs with Telegraf v1.27.1

  1. OPC UA server restarts
  2. Telegraf disconnects it realizes something has changed
  3. Telegraf reconnects
  4. data is read successfully

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 27, 2024
@powersj
Copy link
Contributor

powersj commented Mar 27, 2024

@LarsStegman,

This looks like a duplicate of #13296, but what I don't understand is why the error message from OPCUA changed from bad server not connected to bad node id? Is that due to an update in the library with how it handles errors?

If we are getting a bad node id should we assume we are not connected and re-connect?

@powersj powersj added the waiting for response waiting for response from contributor label Mar 27, 2024
@LarsStegman
Copy link
Contributor

LarsStegman commented Mar 27, 2024

@powersj

Is that due to an update in the library with how it handles errors?

I think this is the only explanation, since nothing else changed in the opc ua code. I noticed the the version we use is quite outdated already, so maybe we should consider updating soon(ish)

If we are getting a bad node id should we assume we are not connected and re-connect?

I would prefer not to. In theory, if the node id is bad, reconnecting to the server shouldn't make a difference. In practice, opc ua is a super complex standard and reconnecting is easier than properly fixing the issue. Sometimes server are not implemented properly, or maybe we set up the connection improperly. Unfortunately, without access to the server to test against, this will be very hard to reproduce.

@john-heywood are you able to reproduce this issue with the open62541/open62541 docker image we use for the unit tests?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 27, 2024
@powersj
Copy link
Contributor

powersj commented Mar 27, 2024

I noticed the the version we use is quite outdated already, so maybe we should consider updating soon(ish)

Our gopcua in master is at 0.5.3, which looks to be the latest? That went out in v1.30.

@jyarr
Copy link

jyarr commented Mar 27, 2024

I have replicated the issue on an OPC UA server hosted on the same manufacture hardware.

Siemens Model: S7-1515F-2PN, Article No:6ES7515-2FM02-0AB0, Firmware 2.9.7.

Telegraf reconnects in v1.27.1 and does not successfully reconnect in v1.27.2+ including the latest v1.30.0 that updated to gopcua 0.5.3.

Attached is my telegraf.conf, telegraf logs, and wire shark captures of successful reconnects in 1.27.1 and failure to reconnect in 1.27.2.

Restarting telegraf in v1.27.2+ was the only way I could get telegraf to reconnect.

telegraf.conf.txt

telegrafLog_v1_27_2_UnsuccessfulReconnect_TelegrafRestartRequired.txt

telegrafLog_v1_27_1_SuccessfulReconnect.txt

telegraf_v1_27_2_opcua_wireshark_capture_UnsuccessfulReconnect_FixedAfterTelegrafRestart.pcapng.gz

telegraf_v1_27_1_opcua_wireshark_capture_SuccessfulReconnect.pcapng.gz

@powersj
Copy link
Contributor

powersj commented May 10, 2024

@jyarr,

Were you able to reproduce this with the open62541/open62541 docker image that Lars mentioned above? That would help narrow down if this is an upstream issue or telegraf issue.

@powersj powersj added the waiting for response waiting for response from contributor label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior waiting for response waiting for response from contributor
Projects
None yet
Development

No branches or pull requests

4 participants