Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate shutdown handling from unhandled exceptions in main bot task #5780

Open
Jackenmen opened this issue Jun 21, 2022 · 10 comments · May be fixed by #6261
Open

Investigate shutdown handling from unhandled exceptions in main bot task #5780

Jackenmen opened this issue Jun 21, 2022 · 10 comments · May be fixed by #6261
Labels
Category: Core - Command-line Interfaces This is related to Red's CLIs (redbot, redbot-launcher, redbot-setup). Status: Needs Info Needs more info, probably from the author. Type: Bug Unexpected behavior, result, or exception. In case of PRs, it is a fix for the foregoing.

Comments

@Jackenmen
Copy link
Member

It would seem that Red may infinitely hang during the handling of such exceptions.

An example:

[2022-06-21 06:44:47] [CRITICAL] red.main: The main bot task didn't handle an exception and has crashed
Traceback (most recent call last):
  File "/home/ubuntu/red-venv/lib/python3.8/site-packages/redbot/__main__.py", line 466, in red_exception_handler
    red_task.result()
  File "/home/ubuntu/red-venv/lib/python3.8/site-packages/redbot/__main__.py", line 360, in run_bot
    await red.start(token, bot=True)
  File "/home/ubuntu/red-venv/lib/python3.8/site-packages/redbot/core/bot.py", line 1187, in start
    await self.login(*args)
  File "/home/ubuntu/red-venv/lib/python3.8/site-packages/discord/client.py", line 511, in login
    await self.http.static_login(token.strip(), bot=bot)
  File "/home/ubuntu/red-venv/lib/python3.8/site-packages/discord/http.py", line 300, in static_login
    data = await self.request(Route('GET', '/users/@me'))
  File "/home/ubuntu/red-venv/lib/python3.8/site-packages/discord/http.py", line 265, in request
    raise DiscordServerError(r, data)
discord.errors.DiscordServerError: 500 Internal Server Error (error code: 0): <html>
<head><title>500 Internal Server Error</title><script async src='/cdn-cgi/bm/cv/669835187/api.js'></script></head>
<body bgcolor="white">
<center><h1>500 Internal Server Error</h1></center>
<hr><center>nginx</center>
<script type="text/javascript">(function(){window['__CF$cv$params']={r:'71eac6d9c8a49b46',m:'_iRxzcZJICzOmK6RBDUCqIPQZX>
</html>
[2022-06-21 06:44:47] [WARNING] red.main: Attempting to die as gracefully as possible...
[2022-06-21 06:44:47] [INFO] red.main: Shutting down from unhandled exception
@Jackenmen Jackenmen added Type: Bug Unexpected behavior, result, or exception. In case of PRs, it is a fix for the foregoing. Category: Bot Core labels Jun 21, 2022
@github-actions github-actions bot added the Status: Needs Triage This has not been labeled or discussed for handling yet. label Jun 21, 2022
@Flame442 Flame442 added Status: Needs Info Needs more info, probably from the author. and removed Status: Needs Triage This has not been labeled or discussed for handling yet. labels Sep 15, 2022
@Jackenmen Jackenmen added Category: Core - Command-line Interfaces This is related to Red's CLIs (redbot, redbot-launcher, redbot-setup). and removed Category: Bot Core labels Feb 12, 2023
@PewnyPL
Copy link

PewnyPL commented Apr 9, 2023

It has been a while since this issue has been posted, however, it seems to be exactly what happened twice to my bot.
I have a bot instance running on Ubuntu 20.04 LTS based server. We had 2 longer power outages recently, in both cases the server got up before the modem could connect back to the ISP, resulting in the bot starting while offline. This is pretty much what happened:

[2023-04-05 11:18:47] [INFO] red: Loading packages...
[2023-04-05 11:18:48] [INFO] red: Loaded packages: permissions, downloader, roletools, admin, mod, cleanup, dice
[2023-04-05 11:18:48] [CRITICAL] red.main: The main bot task didn't handle an exception and has crashed
Traceback (most recent call last):
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 999, in _create_direct_connection
    hosts = await asyncio.shield(host_resolved)
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 865, in _resolve_host
    addrs = await self._resolver.resolve(host, port, family=self._family)
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/resolver.py", line 31, in resolve
    infos = await self._loop.getaddrinfo(
  File "uvloop/loop.pyx", line 1514, in getaddrinfo
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ppl/redenv/lib/python3.9/site-packages/redbot/__main__.py", line 477, in red_exception_handler
    red_task.result()
  File "/home/ppl/redenv/lib/python3.9/site-packages/redbot/__main__.py", line 389, in run_bot
    await red.start(token, bot=True, cli_flags=cli_flags)
  File "/home/ppl/redenv/lib/python3.9/site-packages/redbot/core/bot.py", line 1202, in start
    return await super().start(*args, **kwargs)
  File "/home/ppl/redenv/lib/python3.9/site-packages/discord/client.py", line 665, in start
    await self.login(*args, bot=bot)
  File "/home/ppl/redenv/lib/python3.9/site-packages/discord/client.py", line 511, in login
    await self.http.static_login(token.strip(), bot=bot)
  File "/home/ppl/redenv/lib/python3.9/site-packages/discord/http.py", line 300, in static_login
    data = await self.request(Route('GET', '/users/@me'))
  File "/home/ppl/redenv/lib/python3.9/site-packages/discord/http.py", line 192, in request
    async with self.__session.request(method, url, **kwargs) as r:
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/client.py", line 520, in _request
    conn = await self._connector.connect(
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 535, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 892, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
  File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 1011, in _create_direct_connection
    raise ClientConnectorError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host discord.com:443 ssl:default [Name or service not known]
[2023-04-05 11:18:48] [WARNING] red.main: Attempting to die as gracefully as possible...
[2023-04-05 11:18:48] [INFO] red.main: Shutting down from unhandled exception
[2023-04-09 01:00:32] [INFO] red.main: Signals.SIGTERM received. Quitting...
[2023-04-09 01:00:32] [INFO] red.main: Shutting down with exit code: ExitCodes.SHUTDOWN
[2023-04-09 01:00:32] [INFO] red.main: Please wait, cleaning up a bit more

As soon as the bot loaded, it threw an exception about being unable to connect to Discord. HOWEVER, as you can see from timestamps, despite saying "Shutting down", it does not, it's still running, due to which systemd sees it as still running, not restarting it automatically. It wasn't until a few days later that one of the mods noticed the bot is down that it was manually restarted (hence the 3 messages at the end of the log).
Additionally, here is the capture from systemd showing the log, while also showing the service as active:

ppl@pcdane:~$ sudo service red@VappyBot status
[sudo] password for ppl:
● [email protected] - VappyBot redbot
     Loaded: loaded (/etc/systemd/system/[email protected]; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-04-05 11:18:44 CEST; 3 days ago
   Main PID: 1663 (python)
      Tasks: 8 (limit: 2020)
     Memory: 51.0M
     CGroup: /system.slice/system-red.slice/[email protected]
             └─1663 /home/ppl/redenv/bin/python -O -m redbot VappyBot --no-prompt

Apr 05 11:18:48 pcdane python[1663]:     conn = await self._connector.connect(
Apr 05 11:18:48 pcdane python[1663]:   File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 535, in connect
Apr 05 11:18:48 pcdane python[1663]:     proto = await self._create_connection(req, traces, timeout)
Apr 05 11:18:48 pcdane python[1663]:   File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 892, in _create_connection
Apr 05 11:18:48 pcdane python[1663]:     _, proto = await self._create_direct_connection(req, traces, timeout)
Apr 05 11:18:48 pcdane python[1663]:   File "/home/ppl/redenv/lib/python3.9/site-packages/aiohttp/connector.py", line 1011, in _create_direct_connection
Apr 05 11:18:48 pcdane python[1663]:     raise ClientConnectorError(req.connection_key, exc) from exc
Apr 05 11:18:48 pcdane python[1663]: aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host discord.com:443 ssl:default [Name or service not kn>
Apr 05 11:18:48 pcdane python[1663]: [2023-04-05 11:18:48] [WARNING] red.main: Attempting to die as gracefully as possible...
Apr 05 11:18:48 pcdane python[1663]: [2023-04-05 11:18:48] [INFO] red.main: Shutting down from unhandled exception

Redbot version 3.4.18

@ajwsert
Copy link

ajwsert commented Jul 16, 2023

I have had a similar situation as PewnyPL, usually when there is a power outage the device running the bot comes up faster than the router so there is no connection when the service that starts the bot runs.
Even though I have set the service to wait for network (that I just found out that might not be working) the bot starts first and as result I get a crash. Here is a log of one of those (I have shortened the file paths):

[2023-07-16 09:17:14] [INFO] discord.client: logging in using static token
[2023-07-16 09:17:14] [CRITICAL] red.main: The main bot task didn't handle an exception and has crashed
Traceback (most recent call last):
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/connector.py", line 1152, in _create_direct_connection
    hosts = await asyncio.shield(host_resolved)
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/connector.py", line 874, in _resolve_host
    addrs = await self._resolver.resolve(host, port, family=self._family)
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/resolver.py", line 33, in resolve
    infos = await self._loop.getaddrinfo(
  File "uvloop/loop.pyx", line 1528, in getaddrinfo
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/..[venv path]../lib/python3.9/site-packages/redbot/__main__.py", line 469, in red_exception_handler
    red_task.result()
  File "/..[venv path]../lib/python3.9/site-packages/redbot/__main__.py", line 369, in run_bot
    await red.start(token)
  File "/..[venv path]../lib/python3.9/site-packages/redbot/core/bot.py", line 1270, in start
    await self.login(token)
  File "/..[venv path]../lib/python3.9/site-packages/discord/client.py", line 612, in login
    data = await self.http.static_login(token)
  File "/..[venv path]../lib/python3.9/site-packages/discord/http.py", line 801, in static_login
    data = await self.request(Route('GET', '/users/@me'))
  File "/..[venv path]../lib/python3.9/site-packages/discord/http.py", line 624, in request
    async with self.__session.request(method, url, **kwargs) as response:
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/client.py", line 536, in _request
    conn = await self._connector.connect(
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/connector.py", line 540, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/connector.py", line 901, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
  File "/..[venv path]../lib/python3.9/site-packages/aiohttp/connector.py", line 1166, in _create_direct_connection
    raise ClientConnectorError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host discord.com:443 ssl:default [Name or service not known]
[2023-07-16 09:17:14] [WARNING] red.main: Attempting to die as gracefully as possible...
[2023-07-16 09:17:14] [INFO] red.main: Shutting down from unhandled exception
[2023-07-16 10:44:42] [INFO] red.main: SIGTERM received. Quitting...
[2023-07-16 10:44:42] [INFO] red.main: Shutting down with exit code: 0 (SHUTDOWN)
[2023-07-16 10:44:42] [INFO] red.main: Please wait, cleaning up a bit more

Also note that the process actually terminate around 1h 30m later, despite saying "Shutting down...".

By the way, this is now a request depending on how this will be dealt with: I have set the service to not restart if the Exit code is 0 (just then it doesn't restart if I manually shut it down through [p]shutdown), however, as seen above the Exit code is 0 for this issue then preventing the service from restarting despite the crash. It could be helpful if Exit code wasn't 0 for these cases.

Red v3.5.2 by the way.

@KohGeek
Copy link

KohGeek commented Sep 30, 2023

Hi, I'd like to report that this is plaguing me too. I can also verify by manually blocking discord.com in hosts that, this issue happens even without using systemd, so systemd is not the core issue here.

I'd like to add I am using the postgres version of RedBot. Perhaps all the users above are also using Postgres? This should be a common issue judging by its presentation yet only 3 reporting has been done so far.

Tried but no effect:

  • Using Py 3.11 instead of 3.9
  • Using JSON instead of Postgres

Edit 1: I tried running it with a --no-cogs option, and it seems doesn't seem to have any effect,

Edit 2: Running the code with profiler, it seems that the redbot is stuck in the run_forever loop and any attempts to stop the code must have failed.

image

@CyberKiller
Copy link

CyberKiller commented Oct 9, 2023

I found a workaround for this by adding ExecStartPre=/bin/sh -c 'until ping -c1 discord.com; do sleep 1; done;' under [Service] in the systemd service file for the bot.
This ExecStartPre command causes the ping command to loop until it succeeds, which delays the starting of the bot until discord.com is reachable.

@yamikaitou
Copy link
Contributor

I found a workaround for this by adding ExecStartPre=/bin/sh -c 'until ping -c1 discord.com; do sleep 1; done;' under [Service] in the systemd service file for the bot. This ExecStartPre command causes the ping command to loop until it succeeds, which delays the starting of the bot until discord.com is reachable.

This message is unrelated to this bug. This bug is about the bot shutdown process, not the bot starting up before discord.com is reachable (which is a different issue that is not a bot problem)

@CyberKiller
Copy link

CyberKiller commented Oct 14, 2023

I found a workaround for this by adding ExecStartPre=/bin/sh -c 'until ping -c1 discord.com; do sleep 1; done;' under [Service] in the systemd service file for the bot. This ExecStartPre command causes the ping command to loop until it succeeds, which delays the starting of the bot until discord.com is reachable.

This message is unrelated to this bug. This bug is about the bot shutdown process, not the bot starting up before discord.com is reachable (which is a different issue that is not a bot problem)

The workaround I posted is 100% related to this bug. The bug is that the bot hangs instead of shutting down if discord.com is unreachable.
The workaround mearly prevents the bot from trying to start while discord.com isn't reachable due to network not being ready yet.

Look at the last few lines of the traceback ajwset posted

aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host discord.com:443 ssl:default [Name or service not known]
[2023-07-16 09:17:14] [WARNING] red.main: Attempting to die as gracefully as possible...
[2023-07-16 09:17:14] [INFO] red.main: Shutting down from unhandled exception
[2023-07-16 10:44:42] [INFO] red.main: SIGTERM received. Quitting...
[2023-07-16 10:44:42] [INFO] red.main: Shutting down with exit code: 0 (SHUTDOWN)
[2023-07-16 10:44:42] [INFO] red.main: Please wait, cleaning up a bit more

@Jackenmen
Copy link
Member Author

I agree that the shown scenario is one of the cases that can produce this bug. It is possible that there will need to be some additional care taken since this specific case occurs for shutdowns during startup but I would definitely to properly handle shutdowns happening during startup.

Note that a fix for this issue is unlikely to involve a reconnection strategy which I'm guessing is why Yami said it doesn't relate to this bug - the bot would just actually shutdown (with a proper exit code) rather than hang indefinitely.

As far as workarounds go, this one does help with some of the cases listed in the comments on this issue, it just won't help you resolve the underlying issue of the bot hanging on shutdown if it happens for any other reason than discord.com being unreachable while the bot starts up. Based on this issue, there are some people in the community struggling with it so I'm sure some will find this useful :)

@CyberKiller
Copy link

CyberKiller commented Oct 15, 2023

I had a little poke around debugging with visual studio code I found this is the last line (of the bots code) that gets executed before the hang: https://github.com/Cog-Creators/Red-DiscordBot/blob/7dfe24397ee40114f164c919e77a19c1be45cb5b/redbot/__main__.py#L445C9-L445C9
After Stepping over this line the bot hangs and the debugger hangs too (won't let me pause to debug it anymore). I think some asyncio thing is getting stuck with the shutdown handler.

@ajwsert
Copy link

ajwsert commented Oct 15, 2023

I found a workaround for this by adding ExecStartPre=/bin/sh -c 'until ping -c1 discord.com; do sleep 1; done;' under [Service] in the systemd service file for the bot. This ExecStartPre command causes the ping command to loop until it succeeds, which delays the starting of the bot until discord.com is reachable.

This message is unrelated to this bug. This bug is about the bot shutdown process, not the bot starting up before discord.com is reachable (which is a different issue that is not a bot problem)

Yes, while indeed it doesn't fix the bug it is related and worked as a good workaround for the issue (at least for what I was facing). Delaying the start up while domain name can't be solved helped to prevent the crash and hang at least.

@CyberKiller
Copy link

CyberKiller commented Oct 16, 2023

After some more poking around I found that if an exception isn't caught inside run_bot() and sys.exit() isn't called, the bot will hang inside the exception handler and shutdown handler.
So I created a pull request that adds a catch all exception to catch any aiohttp exceptions (or any other kind of exception) to the relevant place. I tested this as working on linux and windows.

I don't know if having a catch all is acceptable so feel free to make the catch more specific
(I already tried catching discord.HTTPException but discord.py doesn't seem to be catching aiohttp exceptions internally in this case)

Edit: I stopped the exception handler from hanging in the first place instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Core - Command-line Interfaces This is related to Red's CLIs (redbot, redbot-launcher, redbot-setup). Status: Needs Info Needs more info, probably from the author. Type: Bug Unexpected behavior, result, or exception. In case of PRs, it is a fix for the foregoing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants