Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Conformance EKS: Installation and Connectivity Test (1.24, ca-west-1): [check-log-errors]: failed to obtain eni link list: interrupted system call #30990

Open
tommyp1ckles opened this issue Feb 26, 2024 · 9 comments
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Feb 26, 2024

Happening fairly commonly in CI EKS pipeline.

 [=] Test [check-log-errors] [65/65]
.........
  [-] Scenario [check-log-errors/no-errors-in-logs]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (cilium-agent)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (config)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (mount-cgroup)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (apply-sysctl-overwrites)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (mount-bpf-fs)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (clean-cilium-state)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-5x8dh (install-cni-binaries)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-operator-844886dbf6-pxkpv (cilium-operator)]
  [.] Action [check-log-errors/no-errors-in-logs/cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-pctck (cilium-agent)]
  ❌ Found 2 logs in cilium-cilium-8049278400-1.ca-west-1.eksctl.io/kube-system/cilium-pctck (cilium-agent) matching list of errors that must be investigated:
level=error msg="Timed out waiting for ENIs to be attached" attachedENIs="map[]" error="failed to obtain eni link list: interrupted system call" expectedENIs="map[]" subsys=ipam (1 occurrences)
level=error msg="Timed out waiting for ENIs to be attached" attachedENIs="map[]" error="failed to obtain eni link list: interrupted system call" expectedENIs="map[0e:b8:fe:46:45:dc:eni-0a4378a48ca6f8443]" subsys=ipam (1 occurrences)

Example(s):

Sysdump:
cilium-sysdumps(40).zip

@tommyp1ckles tommyp1ckles added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Feb 26, 2024
@tommyp1ckles
Copy link
Contributor Author

Some preliminary investigation, this is failing on netlink.LinkList(), I presume the EINTR error is coming from netlinks recvfrom on the nl socket. Not sure why this is happening, I know we've had issues with netlink requests hanging before so this could be a result of a socket timeout being exceeded?

@tommyp1ckles tommyp1ckles self-assigned this Mar 4, 2024
tommyp1ckles added a commit that referenced this issue Mar 5, 2024
This release includes improved uninstall/cleanup command which should aleviate various issues when re-using clusters while re-running e2e tests.

Fixes: #30990 #30991 #30993

Signed-off-by: Tom Hadlaw <[email protected]>
@pchaigno
Copy link
Member

pchaigno commented Apr 9, 2024

@tommyp1ckles Are you still working on this? If not, let's try to get an assignee.

It's still happening frequently in Conformance EKS.

@jasonaliyetti
Copy link
Contributor

jasonaliyetti commented Apr 15, 2024

This error is something that I'm seeing in my EKS environments when trying to upgrade to 1.15. I can file a separate issue if preferred, but wanted to mention that it had been seen in the wild. This error occurs during ENI attachment for some nodes and leads to connectivity issues. A restart of the agent seems to recover it until a new ENI is attached.

@pchaigno
Copy link
Member

@jasonaliyetti Thanks for the heads up! I think it would indeed be best to open a separate issue. This is for the flake and may be addressed differently from an actual user issue.

@jasonaliyetti
Copy link
Contributor

Just circling back...I think this gets addressed by #32099.

@tommyp1ckles tommyp1ckles removed their assignment Apr 23, 2024
@tommyp1ckles
Copy link
Contributor Author

thanks @jasonaliyetti - from what I can tell this should affect all release branches so let's wait for the PR backports to be done and we can probably close this issue.

@tommyp1ckles tommyp1ckles self-assigned this Apr 23, 2024
@jasonaliyetti
Copy link
Contributor

@tommyp1ckles I see CI failing on the new error log (https://github.com/cilium/cilium/actions/runs/8848559648/job/24298646382#step:23:199). Is there any way to suppress this failure for this message since it just indicates a retry occurred? I'm happy to create an MR but it'd save me some time if you could point me in the right direction.

@giorio94
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants