Connection pooling doesn't work as intended on actual infra #392

kstronka · 2023-09-20T13:37:51Z

Describe the bug

While using the SaaS Boilerplate in production I noticed great inflation in the number of connections pooled by the RDS proxy - greatly proportional to activity of the backend. Upon further examination it turned out majority of those connections were stale and used only once upon creation.

Unfortunately it in my case it goes up so rapidly it causes the connection pool to saturate and DB requests would randomly fail with OperationalError while waiting on the DB connection. ECS tasks tend to become unresponsive as well. This in turn causes outage of the whole system as the tasks become unhealthy and keep getting killed of by the load balancer.

Few key factors at play here:

Django doesn't pool/retain connections by default
There's no CONN_MAX_AGE setting in the config. It defaults to 0 and therefore connections aren't recycled which means we aren't taking leverage of the RDS proxy at all.
We're using the gevent worker. Unfortunately it seems that any type of greenlets will cause another issue: namely connections aren't closed even if CONN_MAX_AGE=0. This was claimed here, and here. But I'm not entirely sure as it can be just due to the fact that RDS proxy maintains those connections.
Timeout of idle connections is 30 mins so it takes as long as 30 minutes for the problems to subside.

Easy fix seems to be switching CONN_MAX_AGE on by default to something more proportionate of that 30 minutes, as well as setting CONN_HEALTH_CHECKS to True and - to be on the safe side - changing gunicorn worker class to sync.

Steps to reproduce

This can be easily observed by inducing some load and by examining statistics a bit later:

SELECT
    datname,
    usename,
    pid,
    state,
    state_change,
    to_char(NOW() - query_start, 'HH24:MI:SS') AS time_idle,
    to_char(NOW() - backend_start, 'HH24:MI:SS') AS time_alive
FROM pg_stat_activity
WHERE state = 'idle' AND backend_type = 'client backend'
;

It's imediately obvious that time_idle is about equal to the time_alive which is far from optimal. Moreover by running

SELECT ROUND(EXTRACT(EPOCH FROM NOW() - query_start) / 60 / 10) * 10 AS time_idle_mins, COUNT(*)
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY ROUND(EXTRACT(EPOCH FROM NOW() - query_start) / 60 / 10) * 10
ORDER BY time_idle_mins
;

on the live system we can see the distribution of connections by their idle time which seems to further back the theory.

System Info

Pre < 2.0.0.

Logs

No response

Validations

Follow our Code of Conduct.
Read the Contributing Guidelines.
Read the docs.
Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
Check that this is a concrete bug. For Q&A open a GitHub Discussion or join our Discord Chat Server.

The text was updated successfully, but these errors were encountered:

mkleszcz added this to the 2.1.2 milestone Sep 22, 2023

pziemkowski added the bug Something isn't working label Sep 29, 2023

mkleszcz modified the milestones: 2.6.0, 2.6.1 Mar 5, 2024

mkleszcz modified the milestones: 2.6.1, 2.7.0 May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pooling doesn't work as intended on actual infra #392

Connection pooling doesn't work as intended on actual infra #392

kstronka commented Sep 20, 2023

Connection pooling doesn't work as intended on actual infra #392

Connection pooling doesn't work as intended on actual infra #392

Comments

kstronka commented Sep 20, 2023

Describe the bug

Steps to reproduce

System Info

Logs

Validations