Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race issue after node reboot #1221

Open
SchSeba opened this issue Feb 1, 2024 · 12 comments · May be fixed by #1213
Open

Race issue after node reboot #1221

SchSeba opened this issue Feb 1, 2024 · 12 comments · May be fixed by #1213

Comments

@SchSeba
Copy link
Contributor

SchSeba commented Feb 1, 2024

Hi, it looks like there is an issue after a node reboot where we can have a race in multus that will prevent the pod from starting

kubectl -n kube-system logs -f kube-multus-ds-ml62q -c install-multus-binary
cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy

The problem is mainly after reboot that the multus-shim gets called by crio to start pods but the multus pod is not able to start because the init container fails to cp the shim.
The reason it failed to copy is because crio called the shim who is stuck waiting for the communication with the pod

[root@virtual-worker-0 centos]# lsof /opt/cni/bin/multus-shim
COMMAND    PID USER  FD   TYPE DEVICE SIZE/OFF     NODE NAME
multus-sh 8682 root txt    REG  252,1 46760102 46241656 /opt/cni/bin/multus-shim
[root@virtual-worker-0 centos]# ps -ef | grep mult
root        8682     936  0 16:27 ?        00:00:00 /opt/cni/bin/multus-shim
root        9082    7247  0 16:28 pts/0    00:00:00 grep --color=auto mult
@SchSeba
Copy link
Contributor Author

SchSeba commented Feb 1, 2024

[root@virtual-worker-0 centos]# ps -ef | grep 942
root         942       1  5 17:07 ?        00:00:00 /usr/bin/crio
root        1246     942  0 17:07 ?        00:00:00 /opt/cni/bin/multus-shim
root        2745    2395  0 17:08 pts/0    00:00:00 grep --color=auto 942

from crio:

from CNI network \"multus-cni-network\": plugin type=\"multus-shim\" name=\"multus-cni-network\" failed (delete): netplugin failed with no error message: signal: killed"

@SchSeba
Copy link
Contributor Author

SchSeba commented Feb 1, 2024

just update doing -f looks like fix the issue in the copy command

@rrpolanco
Copy link

rrpolanco commented Feb 2, 2024

Coincidentally, we also saw this error crop up yesterday with one of our edge clusters after rebooting.

@adrianchiris
Copy link
Contributor

adrianchiris commented Feb 4, 2024

As an FYI i see different deployment yamls use different way to copy the cni binary in init container:

the first one[1] will use install_multus which will copy files in an atomic manner. the latter[2] will just use cp.
(install_multus support both thick and thin plugin types)

although im not sure that copying file atomically will solve the above issue.

see:
[1]

command: ["/install_multus"]

and
[2]
https://github.com/k8snetworkplumbingwg/multus-cni/blob/8e5060b9a7612044b7bf927365bbdbb8f6cde451/deployments/multus-daemonset-thick.yml#L199C9-L204C46

also deployments/multus-daemonset-crio.yml does not use init contianer.

@dougbtv
Copy link
Member

dougbtv commented Feb 15, 2024

This should hopefully be addressed with #1213

@kfox1111
Copy link

Saw this in minikube today. No rebooting, just staring up a new minikube cluster.

@dougbtv
Copy link
Member

dougbtv commented Apr 2, 2024

I also got a reproduction after rebooting a node and having multus restart.

I mitigated it by deleting /opt/cni/bin/multus-shim, but, yeah, I'll retest with the above patch

[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-daemon-config created
daemonset.apps/kube-multus-ds created
[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl logs kube-multus-ds-fzdcr -n kube-system
Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
Error from server (BadRequest): container "kube-multus" in pod "kube-multus-ds-fzdcr" is waiting to start: PodInitializing

@dustinrouillard
Copy link

Seems I can make this happen anytime I ungracefully restart a node, worker or master it creates this error and stops pod network sandbox recreation completely on that node.

The fix mentioned above does work, but this likely means a power outage of a node will require manual intervention whereas otherwise without multus this is not required, this error should be handled properly.

@kfox1111
Copy link

+1. This seems like a pretty serious issue. Can we get a fix merge for it soon please?

@tomroffe
Copy link

tomroffe commented Jun 7, 2024

Additionally can confirm this behavior. as @dougbtv mentioned... removing /opt/cni/bin/multus-shim works as a workaround.

@ulbi
Copy link

ulbi commented Jun 15, 2024

+1 happend to me as well, cluster did not come up. Any chance to fix this soon?

@stefb69
Copy link

stefb69 commented Jun 18, 2024

same here, cluster kubespray 1.29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants