Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket read race condition with URC #175

Closed
arnoutdekimo opened this issue Dec 4, 2023 · 17 comments
Closed

Socket read race condition with URC #175

arnoutdekimo opened this issue Dec 4, 2023 · 17 comments

Comments

@arnoutdekimo
Copy link

arnoutdekimo commented Dec 4, 2023

Hi,

This is somewhat related to #168, but in short, I am seeing the following race condition:

Scenario:
The scenario is that the following happens on AT:

image

Meaning that we write to a socket, and immeidiately afterwards, the modem sends an URC with new data length + socket closes (because server has sent all the data)

Issue:
This results in the following race condition:

In uCellSockRead:

image

The read code, first checks if the URC has executed, and has a length.
But as you can see, during the lock, the URC parsing came in, and only then determined the length.

What's worse:

image

The SOCL URC is parsed immeidately afterwards.

Which then executes, still while the other thread is yielding in atcallbacks

image

The result of this though, is that

image

The socket handle is set to -1, causing the library to poll an invalid socket for data (AT+USORD=-1,0), and the socket API fails to deliver the data.

(FYI, I noticed this while testing on the LENA-R8, while the SARA-R4 did not show this behavior. Presumably because the reply timing is sutbly different)

Kind regards,
Arnout

@arnoutdekimo
Copy link
Author

So the main issue I think is that a socket can be closed by the modem, but this does not mean all data should be "freed" (in sockFree) . Because even when a socket is closed by the remote, we can still be in the position where we want to read data from the socket. Perhaps it should be marked as eligable for freeing, and when the caller calls uSockClose or uSockCleanUp, it should actually be freed?

@arnoutdekimo
Copy link
Author

Sigh, I quickly tried to put a return statement in sockFree, which then still allows the lib to poll the data, but..

image

The modem does not support reading from a socket that was already closed.

Meaning, the actual race condition is in the modem.. The MCU side must somehow read out the data faster than the modem receives the socket close from the server?

image

Yet on AT, this is even at the -same time-, meaning it is impossible to read out this data (unless the server does not close the connection immediately afterwards?)

@RobMeades
Copy link
Contributor

Hmmm, I did wonder if this might be the case. Let me talk to the relevant people internally to see how they anticipated this would work.

@RobMeades
Copy link
Contributor

While I'm checking that, I guess this is what the "shutdown" state of a TCP socket is for, but the AT interface does not export the shutdown state of a socket out as far as the MCU. So, if you do NOT have control over the speed at which the remote server is closing the TCP connection, it might be that you must use PPP mode.

I can't recall which MCU platform you are using but, if it is ESP-IDF, there is a preview branch of our proposed integration with PPP mode, on ESP-IDF, which passes internal testing, here:

https://github.com/u-blox/ubxlib/tree/preview_cell_ppp_espidf_rmea

If you are using ESP-IDF and you fancy going that way, instructions on how to build it can be found here:

https://github.com/u-blox/ubxlib/tree/preview_cell_ppp_espidf_rmea/port/platform/esp-idf#ppp-level-integration-with-cellular

...and an example of using it can be found here:

U_PORT_TEST_FUNCTION("[example]", "examplePppEspIdfSockets")

Note that we've not release this yet [i.e. it remains subject to change, and further testing] as we want to get it working with Zephyr first, and Zephyr PPP support is still experimental (it seems to be fighting back).

@arnoutdekimo
Copy link
Author

it might be that you must use PPP mode.

Well, that sort of defeats the purpose of having a network stack in the module, which is a huge benefit.
Without it, we'd need to run our own TCP stack, our own SSL stack, etc. That has huge implications.
I agree that as a workaround it would work, but it is not usable in my usecase (with the MCU that I am targetting)

If the module supports sockets over AT, I would expect them to be usable :)

@RobMeades
Copy link
Contributor

If the module supports sockets over AT, I would expect them to be usable :)

Understood: I'm trying to determine how this is meant to work within the module.

@arnoutdekimo
Copy link
Author

arnoutdekimo commented Dec 4, 2023

Some other info:

  • (Normally, when using e.g. TCP sockets on posix, and polling for events (e.g. select call), during which the remote closed the connection. the OS will keep saying there is data on sockets until the buffer is depleted, before saying that the socket is closed (to avoid missing data))

  • I tried to configure SO_LINGER. This does not really have anything to do with this behavior (it has to do with keeping the outbound buffer alive after we initiate a close. ) But anyway, I thought I'd quickly give it a try.

However, on the LENA-R8, this seems to simply be refused.

image

(Again, this is less relevant for me, as it should not even fix the issue, but .. it is also unexpected, so I thought I'd mention it)

@RobMeades
Copy link
Contributor

RobMeades commented Dec 4, 2023

Are you doing this through ubxlib with uSockOptionSet(), something like:

uSockLinger_t linger = {.onNotOff = true, lingerSeconds = 10};
uSockOptionSet(descriptor, U_SOCK_OPT_LEVEL_SOCK, U_SOCK_OPT_LINGER, &linger, sizeof(linger));

...or directly with your own AT commands? If the latter than I think it should be AT+USOSO=0,65535,128,1,10; i.e. a 1 to switch linger on and then another parameter with the linger duration:

image

@arnoutdekimo
Copy link
Author

Ah, I indeed was incorrectly supplying an int rather that the uSockLinger_t option. When I use the correct struct (your snippet) the linger option does get accepted.

However, as expected, this does not change the behavior, and the modem keeps refusing the read out data after the socket is closed

image

@RobMeades
Copy link
Contributor

RobMeades commented Dec 4, 2023

Understood. FYI, LENA-R8 is, internally, utterly different from SARA-R422 (in terms of SW as well as in terms of HW); when you tried this with SARA-R422 did it appear to hold-off sending the +UUSOCL: 0 out until all the data was read? Not that this matters if you need LENA-R8 to work in this scenario.

@arnoutdekimo
Copy link
Author

Yes, I think so. I now swapped hardware so I can double check later if I swap back, but if you look at older posts I made, like #165 (comment)

You can see in the logs

0:02:00:917 -
0:02:00:918 - OK
0:02:00:918 -
0:02:00:918 - +UUSORD: 0,136
0:02:00:919 - ATclientreadbytes-result 1024 bytes
0:02:00:919 - AT+USORD=0,136
0:02:01:033 -
0:02:01:073 -
0:02:01:073 - +USORD: 0,136,"[00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][bc][02][00] [00][00][00][00][01][00][00][00][00][00][00][00]J[00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00]J[00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00][00]"
0:02:01:079 -
0:02:01:079 - OK
0:02:01:079 - ATclientreadbytes-READrequest 1 bytes
0:02:01:080 - ATclientreadbytes-result 1 bytes
0:02:01:080 - ATclientreadbytes-READrequest 136 bytes
0:02:01:080 - ATclientreadbytes-result 136 bytes
0:02:01:080 - AT+CSQ
0:02:01:182 -
0:02:01:192 - +UUSOCL: 0

It shows how the SOCL indeed comes just after the last outstanding byte was read + the connection was closed by the remote in the meantime.

@RobMeades
Copy link
Contributor

In that case I think the best thing to do is for me to report this as a LENA-R8 bug. However, that doesn't help you as the FW release cycles for our cellular modules are loooong (given certification requirements etc.). Do you have any influence over the behaviour of the server, any way to delay the socket close?

@arnoutdekimo
Copy link
Author

Yes, I think this is indeed a modem issue, and please do report it:) I already noticed something in your release notes that the LENA-R8 did not support the httplib correctly. I think now manually doing HTTP over sockets won't be that trivial either, at least when the server closes the socket upon replying.

But indeed, I also figured that waiting for the release was not going to be a fast path.
I think changing the server behavior is probably going to be path of least resistance. (But I still have to test that that works, as well as the SSL stack etc). But if I want to have something working asap, that'll probably be the way to go.

@RobMeades
Copy link
Contributor

FYI, not sure if it matters in this use-case or not, but one reason to have the server not close a connection immediately is, if this is with HTTPS, then the device has the opportunity to send further requests on the same connection (or alternatively close the connection) without having to go through all of the TLS security stuff again. In other words, putting the connection-closure responsibility on the device, with just a timeout-type connection closure on the server, gives that little but more flexibility/efficiency.

Of course, if this is purely your FOTA server and the device is not going to use the connection any further, I can see why you do what you do, just thought I'd mention it.

@RobMeades
Copy link
Contributor

Hi Arnout: I can confirm that this has been reported as a LENA-R8 bug and a fix has been made. When a release with that fix in might be public I've no idea I'm afraid, it will depend on approval cycles and the like: would you like me to leave this open until then?

@arnoutdekimo
Copy link
Author

Thanks for the feedback!
In the meantime, I changed the server behavior to not autoclose the connection as a workaround. You don't have to keep the issue open just for me.

@RobMeades
Copy link
Contributor

OK: I will close this issue but, should I find out that a release is scheduled, I will update this issue anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants