-
-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add patroni_maximum_lag_on_replica variable #569
Conversation
Test (dcs_type: "etcd", with_haproxy_load_balancing: true)HAProxy config:
Test command # Set recovery_min_apply_delay on replica
psql -h 10.172.0.21 -p 5432 -c "alter system set recovery_min_apply_delay='2min'"
psql -h 10.172.0.21 -p 5432 -c "select pg_reload_conf()"
# Observe the replication lag
for i in {1..600}; do psql -h 10.172.0.20 -U postgres -p 5432 -c " select now(), client_addr,pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn)) as total_lag from pg_stat_replication"; sleep 2; done
# Run check Patroni REST API for replica
for i in {1..600}; do echo $(date); curl -I http://10.172.0.21:8008/replica?lag=100MB; sleep 2; done
# Connect to replicas (port 5001) and check listen_addresses
for i in {1..600}; do echo $(date); psql -h 10.172.0.20 -p 5001 -U postgres -c "show listen_addresses"; sleep 2; done
# Generate data to create a lag
pgbench -h 10.172.0.20 -p 5432 -U postgres -i -s 10 postgres Result:
We observe that during a lag period of more than 100MB (definitely in patroni_maximum_lag_on_replica), replica 10.172.0.21 is removed from read traffic balancing (port 5001) and connections are routed only to the replica (10.172.0.22) without a high replication lag. Until the lag is below the threshold, and then replica 10.172.0.21 is available again to balance the read-only traffic. passed |
Test (dcs_type: "consul") Consul DNSConsul service config:
Test commands: # Set recovery_min_apply_delay on replica
psql -h 10.172.0.21 -p 5432 -U postgres -c "alter system set recovery_min_apply_delay='2min'"
psql -h 10.172.0.21 -p 5432 -U postgres -c "select pg_reload_conf()"
# Observe the replication lag
for i in {1..600}; do psql -h 10.172.0.20 -U postgres -p 5432 -c " select now(),client_addr,pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn)) as total_lag from pg_stat_replication"; sleep 2; done
# Run check Patroni REST API for replica
for i in {1..600}; do echo $(date); curl -I http://10.172.0.21:8008/replica?lag=100MB; sleep 2; done
# Connect to replicas and check listen_addresses
for i in {1..600}; do echo $(date); psql -h replica.postgres-cluster.service.consul -p 6432 -U postgres -c "show listen_addresses"; sleep 2; done
# Generate data to create a lag
pgbench -h 10.172.0.20 -p 5432 -U postgres -i -s 10 postgres Result:
We observe that during a lag period of more than 100MB (definitely in patroni_maximum_lag_on_replica), replica 10.172.0.21 is removed from read traffic balancing (port 5001) and connections are routed only to the replica (10.172.0.22) without a high replication lag. Until the lag is below the threshold, and then replica 10.172.0.21 is available again to balance the read-only traffic. passed |
Issue: zalando/patroni#1249
Introduce a new configuration variable,
patroni_maximum_lag_on_replica
, with a default value of "100MB". This parameter allows defining a threshold for the maximum acceptable lag for replicas. When a replica's lag surpasses this limit, it will no longer be considered for read-only traffic.The implementation involves appending an optional
?lag=<max-lag>
parameter to the health check forreplica
andasync
endpoints. By doing so, it enables excluding those replicas from load balancing whose lag exceeds the specified maximum, as determined by thepatroni_maximum_lag_on_replica
setting.Documentation: Patroni Health Check Endpoints
This update ensures that only replicas capable of providing timely, consistent read-only access are considered, enhancing the reliability and accuracy of load-balanced read operations in distributed database environments managed by Patroni.
Note: If you have more strict requirements for a lag of replica, reduce the value of the 'patroni_maximum_lag_on_replica' variable or consider using synchronous replication. Conversely, if the lag doesn't matter much to your application, increase the value of the variable.