Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd snapshot cannot runnning successfully #44

Open
damuji8 opened this issue Dec 21, 2023 · 6 comments
Open

etcd snapshot cannot runnning successfully #44

damuji8 opened this issue Dec 21, 2023 · 6 comments

Comments

@damuji8
Copy link

damuji8 commented Dec 21, 2023

using milvus helm 4.1.9 etcd image is 3.5.5-r2.
in this image /opt/bitnami/scripts/etcd/snapshot.sh
/opt/bitnami/scripts/libetcd.sh

etcdctl_get_endpoints() {
echo "$ETCD_INITIAL_CLUSTER" | sed 's/^[^=]+=http/http/g' |sed 's/,[^=]+=/,/g'
}

i need to add env ETCD_INITIAL_CLUSTER in cronjob.

without this env . will show error "all etcd endpoints are unhealthy!"

in etcd etcd:3.5.5-debian-11-r23
/opt/bitnami/scripts/libetcd.sh
etcdctl_get_endpoints() {
local only_others=${1:-false}
local -a endpoints=()
local host domain port

ip_has_valid_hostname() {
    local ip="${1:?ip is required}"
    local parent_domain="${1:?parent_domain is required}"

    # 'getent hosts $ip' can return hostnames in 2 different formats:
    #     POD_NAME.HEADLESS_SVC_DOMAIN.NAMESPACE.svc.cluster.local (using headless service domain)
    #     10-237-136-79.SVC_DOMAIN.NAMESPACE.svc.cluster.local (using POD's IP and service domain)
    # We need to discad the latter to avoid issues when TLS verification is enabled.
    [[ "$(getent hosts "$ip")" = *"$parent_domain"* ]] && return 0
    return 1
}

hostname_has_ips() {
    local hostname="${1:?hostname is required}"
    [[ "$(getent ahosts "$hostname")" != "" ]] && return 0
    return 1
}

# This piece of code assumes this code is executed on a K8s environment
# where etcd members are part of a statefulset that uses a headless service
# to create a unique FQDN per member. Under these circumstances, the
# ETCD_ADVERTISE_CLIENT_URLS env. variable is created as follows:
#   SCHEME://POD_NAME.HEADLESS_SVC_DOMAIN:CLIENT_PORT,SCHEME://SVC_DOMAIN:SVC_CLIENT_PORT
#
# Assuming this, we can extract the HEADLESS_SVC_DOMAIN and obtain
# every available endpoint
read -r -a advertised_array <<<"$(tr ',;' ' ' <<<"$ETCD_ADVERTISE_CLIENT_URLS")"
host="$(parse_uri "${advertised_array[0]}" "host")"
port="$(parse_uri "${advertised_array[0]}" "port")"
domain="${host#"${ETCD_NAME}."}"
# When ETCD_CLUSTER_DOMAIN is set, we use that value instead of extracting
# it from ETCD_ADVERTISE_CLIENT_URLS
! is_empty_value "$ETCD_CLUSTER_DOMAIN" && domain="$ETCD_CLUSTER_DOMAIN"
# Depending on the K8s distro & the DNS plugin, it might need
# a few seconds to associate the POD(s) IP(s) to the headless svc domain
if retry_while "hostname_has_ips $domain"; then
    local -r ahosts="$(getent ahosts "$domain" | awk '{print $1}' | uniq | wc -l)"
    for i in $(seq 0 $((ahosts - 1))); do
        # We use the StatefulSet name stored in MY_STS_NAME to get the peer names based on the number of IPs registered in the headless service
        pod_name="${MY_STS_NAME}-${i}"
        if ! { [[ $only_others = true ]] && [[ "$pod_name" = "$MY_POD_NAME" ]]; }; then
            endpoints+=("${pod_name}.${ETCD_CLUSTER_DOMAIN}:${port:-2380}")
        fi
    done
fi
echo "${endpoints[*]}" | tr ' ' ','

}

bitnami helm template has the env ETCD_CLUSTER_DOMAIN and MY_STS_NAME. So we can running snapshot successfully.

i think this is problem.

@haorenfsa
Copy link
Collaborator

Hi @damuji8, thank you for this feedback. We forked the bitnami etcd docker image https://github.com/milvus-io/bitnami-docker-etcd to solve its problem when occasional initialization failure and to solve scale out problem.

We didn't use or test other functions than running or scaling, so they are very likely to be broken. We forked the repo at the tag 3.4.18-debian-10-r50, so features after this tag is not supported, either.

@haorenfsa
Copy link
Collaborator

haorenfsa commented Jan 23, 2024

And for this paticular case, I believe the bitnami's way of handling this is way too complicated. So I removed all the logics in etcdctl_get_endpoints() and to use ETCD_INITIAL_CLUSTER directly. Which is why you can see only one line code in etcdctl_get_endpoints().

I checked the template, and my test release, I'm sure the ETCD_CLUSTER_DOMAIN is set , but MY_STS_NAME is not set. It's because the etcd chart version we're using is 6.3.3 which is a quite old version. But it's very stable, and we're not intended to change it.

You may add it by setting env vars in helm release values.
Or it's also very welcome if you'd like to lanch a PR to add it in default values if you've got time.

@damuji8
Copy link
Author

damuji8 commented Jan 23, 2024

in your etcd.you have env ETCD_INITIAL_CLUSTER? i set this env in etcd helm template by myself.

@haorenfsa
Copy link
Collaborator

@damuji8 Yes, ETCD_INITIAL_CLUSTER is included in the statefulset template

@damuji8
Copy link
Author

damuji8 commented Jan 23, 2024

in cronjob yaml. is not included ETCD_INITIAL_CLUSTER env .

@damuji8
Copy link
Author

damuji8 commented Jan 23, 2024

without this env.can not take snapshot successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants