-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: prevent failovers when disk space is exhausted #4404
base: main
Are you sure you want to change the base?
Conversation
❗ By default, the pull request is configured to backport to all release branches.
|
4012706
to
126566d
Compare
I tested this using Longhorn in a Fedora VM, but any storage enforcing the PV capacity will do the trick. To test the patch, you need to finish your WAL storage. To keep things easy, I used: apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
spec:
instances: 1
storage:
size: 256Mi And then: CREATE TABLE storage_area (t text);
-- repeat the following query 20-30 times (you need to be fast!)
INSERT INTO storage_area (t) (select repeat('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do', 5*1024*1024)); With the predefined WAL settings, you'll finish your WAL disk space before you finish the space for PGDATA. |
9b2cea6
to
e625d8f
Compare
/test limit=local |
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9110497781 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About to start adding documentation and going over the E2E, but left a few comments on the implementation bits.
IMO the "WALDisk" nomenclature could get confusing as it seems to imply there is a separate WAL volume, which may or may not be the case.
172cfe7
to
3cdc43a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's worth renaming the ensureSufficientDiskSpace
method, but otherwise give this an enthusiastic 👍
Add an e2e to test the recovery in case a primary runs out of disk space. Signed-off-by: Francesco Canovai <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Signed-off-by: Armando Ruocco <[email protected]>
Signed-off-by: Jaime Silvela <[email protected]>
Signed-off-by: Jaime Silvela <[email protected]>
Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>
Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>
Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Jaime Silvela <[email protected]>
Signed-off-by: Jaime Silvela <[email protected]>
Signed-off-by: Jaime Silvela <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Gabriele Bartolini <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
Signed-off-by: Leonardo Cecchi <[email protected]>
We removed the The operator sets this value from the exit code of the instance Pods. The instance Pods check whether they have enough WAL disk space before starting PG and after PostgreSQL exits with an error condition. |
PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files.
The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue.
Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster.
This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation.
After the PVCs are resized, the cluster will restart working correctly.
Closes: #4521