-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sometimes doesn't recover from network errors on Linux #215
Comments
This was pilot error, sorry, I was misled by something indicating the network was resolving DNS when in fact it wasn't. |
No worries - thanks for following up. |
I might have spoken too soon, as I can currently see this:
is it possible negative lookup has somehow got cached by the process? Note that in other cases where I lose connectivity, it does recover. This might have something to do with openconnect connection loss. |
Thanks for following up. This reminds me of issues I had myself when using the proxy on a macOS laptop – it wouldn't always regain connectivity after sleep, and it seemed to be a Python issue rather than anything that the proxy itself was doing. I never managed to find the root cause, so as a workaround I added support for macOS's I haven't seen the issue again since adding this functionality. Do you know of a similar mechanism for your operating system? |
I don't know offhand of any such API on Linux. Since this only happens sometimes, I'm going to try strace() to see if I can winkle out the circumstances. It's definitely related to VPN teardown, but happens (sometimes) even when DNS is pointing at the right server again. |
Coincidentally I just ran into something similar to this after using a patchy internet connection for a few weeks. The proxy got into a state where it worked fine when tokens hadn't yet expired, but those requiring renewal were unable to resolve the token URL – every request would time out, even when there was a reliable internet connection. I wondered whether this was because the This is just a theory, and it may well be incorrect (for example, it's a little strange that backed-up connections that don't themselves time out might cause future ones to time out). I don't really have a reliable way of testing it, but it did remind me of this issue, so I thought I'd mention it. |
I've been watching like a hawk for problems since filing this ticket and predictably it's behaved flawlessly. Nonetheless, I've updated to include that change. If I can reproduce any issue again I'll update. |
Just wondering whether this has happened again since the last update? |
Nope! Feel free to close. It's either your theory above, or I've just been lucky. Either way I can file a new one if needed. Thanks! |
Good to know - let's hope it doesn't happen again, but if it does, feel free to reopen this issue so we have the context. |
Caught it again
I have a corefile if there's any chance post-mortem debugging could help, or I'm also happy to add any/all debug you might suggest to catch it again next time. (Unfortunately I rely on it so can't leave it broken for long!) |
Thanks for following up – yes, you might as well share the corefile just in case (though there's probably only a slim chance it'll be useful). I don't really have any suggestions about debugging you could add, and am still really puzzled as to what could be causing this, but will reopen the issue in case anyone else happens to encounter it. One obvious solution is to hardcode the I would still like to stop this happening though, so I did a bit of searching – none of the leads were particularly useful, but there are a few suggestions:
The most useful thing would be to find a way to reliably replicate this, but perhaps one of these leads will get you there. Let's hope so! |
No unusual settings in gai.conf. I've added much more debug to create_socket() - thanks. Will report back if and when it happens again.
Why do you have a loop that will only ever iterate once here? Is this trying to account for |
BTW I have a theory here, this particular host has no ipv6 connectivity. I suspect that in some situations when taking down openconnect, it ends up re-ordering such that ipv6 are first in the GAI results. When ipv4 connectivity is restored they end up at the end, but we never get to those working addresses. |
I think you're on to something here – initially this was a feature submitted in #140 by an external contributor, but it happened in parallel to my own development, so I merged the two versions. But it looks like I didn't fully capture the intent of that loop. It might be worth trying the version below, then attempting to replicate the network issues you've outlined to see what debug output you get. If you see the def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
# look up address and create a socket for the first resolved IPv4 or IPv6 address that is successful
for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
try:
super().create_socket(a[0], socket.SOCK_STREAM)
return
except OSError as e:
Log.debug(self.info_string(), 'Unable to create socket', a, ':', e, '- trying next result')
raise socket.gaierror(8, 'getaddrinfo failed: unable to resolve host') |
I'm running a chattier, fatal version of the above (since I want to catch it to prove the theory). It may take a long time to see it though. I think you need to still re-raise the exception if you've run out of "a" to try BTW |
It was not that. With this diff:
Happy case:
Unhappy case that I reproduced today:
That is the getaddrinfo() itself is raising the exception not any attempt to connect to a socket. |
Another python3 invocation at the same time:
So this doesn't appear to be a problem generic to python in this case |
strace from the proxy process:
ELIDED and ELIDED2 are in RFC1918 10.0.0.0 and I suspect are the correct nameservers - but from when the VPN connection was up! They are no longer reachable but the python process is stuck on them. |
did not fix it, though I did not expect it to. Somehow this particular process has apparently managed to cache the previous resolvectl setting (since I expect the vpnc-script to modify things that way) |
Symptoms are a match for e.g. https://sourceware.org/bugzilla/show_bug.cgi?id=25420 Nonetheless, I think I've proven to my satisfaction that this issue has nothing to do with emailproxy itself at all, so feel free to close the ticket if you like. |
Thanks for the detailed follow-ups. I agree with your conclusion that this is unrelated to the proxy, and is likely to be a problem whose root cause is somewhere else on the network stack. I do think the proposed edit is worthwhile because it makes other potential network issues a bit more transparent. Before merging that, though, I wondered whether a small tweak to that approach might let us work around these issues from the proxy's side. The current def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
# look up address and create a socket for the first resolved IPv4 or IPv6 address that is successful
try:
gai = socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM)
except OSError as e:
# see: https://github.com/simonrob/email-oauth2-proxy/issues/215 - getaddrinfo can fail
Log.debug(self.info_string(), 'Falling back to default socket; getaddrinfo failed:', e)
super().create_socket(socket.AF_INET, socket.SOCK_STREAM)
return
for address in gai:
try:
super().create_socket(address[0], socket.SOCK_STREAM)
return
except OSError as e:
Log.debug(self.info_string(), 'Unable to create socket', address, ':', e, '- trying next result')
raise socket.gaierror(8, 'All socket creation attempts failed: unable to resolve host') |
Yes, the fix for the exception loop should be made regardless; it's not related to this ticket, but it does look like a bug. Re the proposed change, this would probably not help as all name resolution is broken at this point in glibc. I spent some time in its bowels and got this far: https://sourceware.org/pipermail/libc-alpha/2024-March/155234.html To my eyes this is a plain glibc bug but we'll see. |
Now filed here: https://sourceware.org/bugzilla/show_bug.cgi?id=31476 |
I'm using the proxy as a systemd service, and if I, for example, unplug my router, then I get:
That's fine, of course, but the proxy never recovers from this state when networking is up again. I have to restart it.
The text was updated successfully, but these errors were encountered: