feat: prevent failovers when disk space is exhausted #4404

leonardoce · 2024-04-29T15:46:43Z

PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files.

The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue.

Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster.

This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation.

After the PVCs are resized, the cluster will restart working correctly.

Closes: #4521

github-actions · 2024-04-29T15:46:58Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

leonardoce · 2024-05-13T10:10:05Z

I tested this using Longhorn in a Fedora VM, but any storage enforcing the PV capacity will do the trick.

To test the patch, you need to finish your WAL storage. To keep things easy, I used:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
spec:
  instances: 1

  storage:
    size: 256Mi

And then:

CREATE TABLE storage_area (t text);

-- repeat the following query 20-30 times (you need to be fast!)
INSERT INTO storage_area (t) (select repeat('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do', 5*1024*1024));

With the predefined WAL settings, you'll finish your WAL disk space before you finish the space for PGDATA.

controllers/cluster_controller.go

pkg/management/postgres/instance.go

controllers/cluster_controller.go

armru · 2024-05-16T10:13:57Z

/test limit=local

github-actions · 2024-05-16T10:14:12Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9110497781

jsilvela

About to start adding documentation and going over the E2E, but left a few comments on the implementation bits.
IMO the "WALDisk" nomenclature could get confusing as it seems to imply there is a separate WAL volume, which may or may not be the case.

internal/cmd/manager/instance/run/lifecycle/run.go

pkg/fileutils/directory.go

pkg/management/postgres/instance.go

pkg/utils/fencing.go

controllers/cluster_controller.go

jsilvela

I still think it's worth renaming the ensureSufficientDiskSpace method, but otherwise give this an enthusiastic 👍

docs/src/instance_manager.md

docs/src/troubleshooting.md

Add an e2e to test the recovery in case a primary runs out of disk space. Signed-off-by: Francesco Canovai <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>

Signed-off-by: Armando Ruocco <[email protected]>

Signed-off-by: Jaime Silvela <[email protected]>

Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>

Signed-off-by: Leonardo Cecchi <[email protected]>

Signed-off-by: Jaime Silvela <[email protected]>

Signed-off-by: Leonardo Cecchi <[email protected]>

Signed-off-by: Gabriele Bartolini <[email protected]>

Signed-off-by: Leonardo Cecchi <[email protected]>

leonardoce · 2024-06-04T08:26:51Z

We removed the WALSpaceAvailable because we're worried that it could be misunderstood and used as a metric instead of proper monitoring infrastructure.

The operator sets this value from the exit code of the instance Pods. The instance Pods check whether they have enough WAL disk space before starting PG and after PostgreSQL exits with an error condition.
This is not a way to monitor disk space.

leonardoce · 2024-06-04T08:29:56Z

Full E2e run: https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/9363810554

github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 release-1.23 labels Apr 29, 2024

leonardoce force-pushed the dev/space branch 3 times, most recently from 4012706 to 126566d Compare May 13, 2024 09:54

leonardoce marked this pull request as ready for review May 13, 2024 10:02

leonardoce requested a review from a team as a code owner May 13, 2024 10:02

gbartolini reviewed May 13, 2024

View reviewed changes

controllers/cluster_controller.go Outdated Show resolved Hide resolved

gbartolini reviewed May 13, 2024

View reviewed changes

pkg/management/postgres/instance.go Outdated Show resolved Hide resolved

armru reviewed May 15, 2024

View reviewed changes

controllers/cluster_controller.go Outdated Show resolved Hide resolved

armru force-pushed the dev/space branch 2 times, most recently from 9b2cea6 to e625d8f Compare May 15, 2024 15:25

armru requested review from jsilvela, NiccoloFei and litaocdl as code owners May 16, 2024 10:04

github-actions bot added the ok to merge 👌 This PR can be merged label May 16, 2024

armru approved these changes May 20, 2024

View reviewed changes

jsilvela reviewed May 20, 2024

View reviewed changes

controllers/cluster_controller.go Outdated Show resolved Hide resolved

leonardoce force-pushed the dev/space branch 2 times, most recently from 172cfe7 to 3cdc43a Compare May 21, 2024 09:47

jsilvela approved these changes May 21, 2024

View reviewed changes

jsilvela reviewed May 21, 2024

View reviewed changes

docs/src/instance_manager.md Outdated Show resolved Hide resolved

jsilvela reviewed May 21, 2024

View reviewed changes

docs/src/troubleshooting.md Outdated Show resolved Hide resolved

fcanovai and others added 22 commits June 4, 2024 09:31

test: out of disk space recovery scenario

5a5e25d

Add an e2e to test the recovery in case a primary runs out of disk space. Signed-off-by: Francesco Canovai <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>

review: bulk fencing and noWalDiskSpace status

27485eb

Signed-off-by: Armando Ruocco <[email protected]>

chore: more structured approach to size probing

e60b9f2

Signed-off-by: Armando Ruocco <[email protected]>

chore: rename size_probe -> directory

8f2141e

Signed-off-by: Armando Ruocco <[email protected]>

docs: add top-level documentation

e8d5b26

Signed-off-by: Jaime Silvela <[email protected]>

docs: commas

d302b09

Signed-off-by: Jaime Silvela <[email protected]>

chore: fix grammar in pkg/fileutils/directory.go

2a5e67d

Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>

chore: fix grammar in pkg/fileutils/directory.go

99eedca

Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>

chore: fix pkg/utils/fencing.go

4800cf9

Co-authored-by: Jaime Silvela <[email protected]> Signed-off-by: Leonardo Cecchi <[email protected]>

chore: address Gabriele's comments

c4fce0c

Signed-off-by: Leonardo Cecchi <[email protected]>

chore: address Jaime's comments

109e02d

Signed-off-by: Leonardo Cecchi <[email protected]>

chore: improve naming

4b03c03

Signed-off-by: Leonardo Cecchi <[email protected]>

review: clarify documentation

b022712

Signed-off-by: Jaime Silvela <[email protected]>

Update docs/src/instance_manager.md

61d0b76

Signed-off-by: Jaime Silvela <[email protected]>

Update docs/src/troubleshooting.md

1593a4b

Signed-off-by: Jaime Silvela <[email protected]>

chore: directory vs diskprobe

5040b32

Signed-off-by: Leonardo Cecchi <[email protected]>

chore: rename ensureSufficientDiskSpace to ensureNoFailoverOnFullDisk to

8a859fc

Signed-off-by: Leonardo Cecchi <[email protected]>

docs: cosmetic changes

3cdbb32

Signed-off-by: Gabriele Bartolini <[email protected]>

feat: implementation using exit codes and no fencing

71fdc3c

Signed-off-by: Leonardo Cecchi <[email protected]>

fix: reduce required space to a single wal

941fab7

docs: improve documentation

20a7a33

chore: remove WALSpaceAvailable field

957d3eb

Signed-off-by: Leonardo Cecchi <[email protected]>

leonardoce force-pushed the dev/space branch from 20868d7 to 957d3eb Compare June 4, 2024 08:22

leonardoce added do not backport This PR must not be backported - it will be in the next minor release and removed backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 release-1.23 labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: prevent failovers when disk space is exhausted #4404

feat: prevent failovers when disk space is exhausted #4404

leonardoce commented Apr 29, 2024 •

edited

github-actions bot commented Apr 29, 2024

leonardoce commented May 13, 2024

armru commented May 16, 2024

github-actions bot commented May 16, 2024

jsilvela left a comment •

edited

jsilvela left a comment

leonardoce commented Jun 4, 2024

leonardoce commented Jun 4, 2024

feat: prevent failovers when disk space is exhausted #4404

Are you sure you want to change the base?

feat: prevent failovers when disk space is exhausted #4404

Conversation

leonardoce commented Apr 29, 2024 • edited

github-actions bot commented Apr 29, 2024

leonardoce commented May 13, 2024

armru commented May 16, 2024

github-actions bot commented May 16, 2024

jsilvela left a comment • edited

Choose a reason for hiding this comment

jsilvela left a comment

Choose a reason for hiding this comment

leonardoce commented Jun 4, 2024

leonardoce commented Jun 4, 2024

leonardoce commented Apr 29, 2024 •

edited

jsilvela left a comment •

edited