Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pooling doesn't work as intended on actual infra #392

Open
5 tasks done
kstronka opened this issue Sep 20, 2023 · 0 comments
Open
5 tasks done

Connection pooling doesn't work as intended on actual infra #392

kstronka opened this issue Sep 20, 2023 · 0 comments
Labels
bug Something isn't working
Milestone

Comments

@kstronka
Copy link
Contributor

Describe the bug

While using the SaaS Boilerplate in production I noticed great inflation in the number of connections pooled by the RDS proxy - greatly proportional to activity of the backend. Upon further examination it turned out majority of those connections were stale and used only once upon creation.

Unfortunately it in my case it goes up so rapidly it causes the connection pool to saturate and DB requests would randomly fail with OperationalError while waiting on the DB connection. ECS tasks tend to become unresponsive as well. This in turn causes outage of the whole system as the tasks become unhealthy and keep getting killed of by the load balancer.

Few key factors at play here:

  • Django doesn't pool/retain connections by default
  • There's no CONN_MAX_AGE setting in the config. It defaults to 0 and therefore connections aren't recycled which means we aren't taking leverage of the RDS proxy at all.
  • We're using the gevent worker. Unfortunately it seems that any type of greenlets will cause another issue: namely connections aren't closed even if CONN_MAX_AGE=0. This was claimed here, and here. But I'm not entirely sure as it can be just due to the fact that RDS proxy maintains those connections.
  • Timeout of idle connections is 30 mins so it takes as long as 30 minutes for the problems to subside.

Easy fix seems to be switching CONN_MAX_AGE on by default to something more proportionate of that 30 minutes, as well as setting CONN_HEALTH_CHECKS to True and - to be on the safe side - changing gunicorn worker class to sync.

Steps to reproduce

This can be easily observed by inducing some load and by examining statistics a bit later:

SELECT
    datname,
    usename,
    pid,
    state,
    state_change,
    to_char(NOW() - query_start, 'HH24:MI:SS') AS time_idle,
    to_char(NOW() - backend_start, 'HH24:MI:SS') AS time_alive
FROM pg_stat_activity
WHERE state = 'idle' AND backend_type = 'client backend'
;

It's imediately obvious that time_idle is about equal to the time_alive which is far from optimal. Moreover by running

SELECT ROUND(EXTRACT(EPOCH FROM NOW() - query_start) / 60 / 10) * 10 AS time_idle_mins, COUNT(*)
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY ROUND(EXTRACT(EPOCH FROM NOW() - query_start) / 60 / 10) * 10
ORDER BY time_idle_mins
;

on the live system we can see the distribution of connections by their idle time which seems to further back the theory.

System Info

Pre < 2.0.0.

Logs

No response

Validations

@mkleszcz mkleszcz added this to the 2.1.2 milestone Sep 22, 2023
@pziemkowski pziemkowski added the bug Something isn't working label Sep 29, 2023
@mkleszcz mkleszcz modified the milestones: 2.6.0, 2.6.1 Mar 5, 2024
@mkleszcz mkleszcz modified the milestones: 2.6.1, 2.7.0 May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants