You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using the SaaS Boilerplate in production I noticed great inflation in the number of connections pooled by the RDS proxy - greatly proportional to activity of the backend. Upon further examination it turned out majority of those connections were stale and used only once upon creation.
Unfortunately it in my case it goes up so rapidly it causes the connection pool to saturate and DB requests would randomly fail with OperationalError while waiting on the DB connection. ECS tasks tend to become unresponsive as well. This in turn causes outage of the whole system as the tasks become unhealthy and keep getting killed of by the load balancer.
Few key factors at play here:
Django doesn't pool/retain connections by default
There's no CONN_MAX_AGE setting in the config. It defaults to 0 and therefore connections aren't recycled which means we aren't taking leverage of the RDS proxy at all.
We're using the gevent worker. Unfortunately it seems that any type of greenlets will cause another issue: namely connections aren't closed even if CONN_MAX_AGE=0. This was claimed here, and here. But I'm not entirely sure as it can be just due to the fact that RDS proxy maintains those connections.
Timeout of idle connections is 30 mins so it takes as long as 30 minutes for the problems to subside.
Easy fix seems to be switching CONN_MAX_AGE on by default to something more proportionate of that 30 minutes, as well as setting CONN_HEALTH_CHECKS to True and - to be on the safe side - changing gunicorn worker class to sync.
Steps to reproduce
This can be easily observed by inducing some load and by examining statistics a bit later:
SELECT
datname,
usename,
pid,
state,
state_change,
to_char(NOW() - query_start, 'HH24:MI:SS') AS time_idle,
to_char(NOW() - backend_start, 'HH24:MI:SS') AS time_alive
FROM pg_stat_activity
WHERE state ='idle'AND backend_type ='client backend'
;
It's imediately obvious that time_idle is about equal to the time_alive which is far from optimal. Moreover by running
SELECT ROUND(EXTRACT(EPOCH FROM NOW() - query_start) /60/10) *10AS time_idle_mins, COUNT(*)
FROM pg_stat_activity
WHERE backend_type ='client backend'GROUP BY ROUND(EXTRACT(EPOCH FROM NOW() - query_start) /60/10) *10ORDER BY time_idle_mins
;
on the live system we can see the distribution of connections by their idle time which seems to further back the theory.
Describe the bug
While using the SaaS Boilerplate in production I noticed great inflation in the number of connections pooled by the RDS proxy - greatly proportional to activity of the backend. Upon further examination it turned out majority of those connections were stale and used only once upon creation.
Unfortunately it in my case it goes up so rapidly it causes the connection pool to saturate and DB requests would randomly fail with
OperationalError
while waiting on the DB connection. ECS tasks tend to become unresponsive as well. This in turn causes outage of the whole system as the tasks become unhealthy and keep getting killed of by the load balancer.Few key factors at play here:
CONN_MAX_AGE
setting in the config. It defaults to 0 and therefore connections aren't recycled which means we aren't taking leverage of the RDS proxy at all.gevent
worker. Unfortunately it seems that any type of greenlets will cause another issue: namely connections aren't closed even ifCONN_MAX_AGE=0
. This was claimed here, and here. But I'm not entirely sure as it can be just due to the fact that RDS proxy maintains those connections.Easy fix seems to be switching
CONN_MAX_AGE
on by default to something more proportionate of that 30 minutes, as well as settingCONN_HEALTH_CHECKS
toTrue
and - to be on the safe side - changing gunicorn worker class tosync
.Steps to reproduce
This can be easily observed by inducing some load and by examining statistics a bit later:
It's imediately obvious that
time_idle
is about equal to thetime_alive
which is far from optimal. Moreover by runningon the live system we can see the distribution of connections by their idle time which seems to further back the theory.
System Info
Pre < 2.0.0.
Logs
No response
Validations
The text was updated successfully, but these errors were encountered: