Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD backup script will delete other files when there is no space left on device #1625

Open
24sama opened this issue Nov 22, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@24sama
Copy link
Collaborator

24sama commented Nov 22, 2022

What is version of KubeKey has the issue?

v3.0.1, v3.0.0, v2.3.0, v2.2.2, v2.2.1, v2.2.0, v2.1.1, v2.1.0, v2.0.0, v1.2.1, v1.2.0, v1.1.1, v1.1.0, v1.0.1

What is your os environment?

none

KubeKey config file

No response

A clear and concise description of what happend.

There is a very extreme case where the kk backup etcd script may erroneously delete / directory files when the node has no space to create directories (i.e not even 4096K).

Suggest using the latest version:

Binary downloads of the latest kk can be found on the Releases page.
Or
Download the latest kk by the following command

curl -sSL https://get-kk.kubesphere.io | sh -

And for the existing cluster installed by KubeKey command (kk), here is a solution.

  1. manually editing the script:
$ vi /usr/local/bin/kube-scripts/etcd-backup.sh
  1. modify the script like the below:

    1. add set -o xxx at the beginning of the script
    2. replace the ; after the cd command with && in the last line

    Here is an example:

#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

ETCDCTL_PATH='/usr/local/bin/etcdctl'
ENDPOINTS='https://192.168.100.3:2379'
ETCD_DATA_DIR="/var/lib/etcd"
BACKUP_DIR="/var/backups/kube_etcd/etcd-$(date +%Y-%m-%d-%H-%M-%S)"
KEEPBACKUPNUMBER='6'
ETCDBACKUPSCIPT='/usr/local/bin/kube-scripts'

ETCDCTL_CERT="/etc/ssl/etcd/ssl/admin-node1.pem"
ETCDCTL_KEY="/etc/ssl/etcd/ssl/admin-node1-key.pem"
ETCDCTL_CA_FILE="/etc/ssl/etcd/ssl/ca.pem"

[ ! -d $BACKUP_DIR ] && mkdir -p $BACKUP_DIR

export ETCDCTL_API=2;$ETCDCTL_PATH backup --data-dir $ETCD_DATA_DIR --backup-dir $BACKUP_DIR

sleep 3

{
export ETCDCTL_API=3;$ETCDCTL_PATH --endpoints="$ENDPOINTS" snapshot save $BACKUP_DIR/snapshot.db \
                                   --cacert="$ETCDCTL_CA_FILE" \
                                   --cert="$ETCDCTL_CERT" \
                                   --key="$ETCDCTL_KEY"
} > /dev/null 

sleep 3

cd $BACKUP_DIR/../ && ls -lt |awk '{if(NR > '$KEEPBACKUPNUMBER'){print "rm -rf "$9}}'|sh
  1. reload the new script:
$ systemctl daemon-reload

Relevant log output

No response

Additional information

No response

@24sama 24sama added the bug Something isn't working label Nov 22, 2022
@24sama 24sama pinned this issue Nov 22, 2022
@zjuwyz
Copy link

zjuwyz commented Apr 17, 2023

We've unfortunately encounted with this bug. The root partition is mounted with option 'error=remount-ro', and accidently triggered it. So mkdir -p failed, cd failed, / is deleted.
Our data is shared with NAS and mounted under /. And they're all GONE.

etcdctl version is 3.4.13.

img_v2_1a145ee1-5f41-4e47-9937-64601f9f34bg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants