Skip to content

Commit

Permalink
ipsec: Safely delete xfrm state
Browse files Browse the repository at this point in the history
This patch introduces a workaround to avoid kernel issue when deleting
xfrm states.

Let's start from the kernel issue.

After installing two xfrm states on the same host using below commands
(please note the differences on the mark, mask, src):

```
ip x s a src 10.244.1.43 dst 10.244.3.114 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0x2a450d00 mask 0xffff0f00 output-mark 0xd00 mask 0xffffff00 aead 'rfc4106(gcm(aes))' 0x42a89014074c49243219a20cc87cadd7b9c0d7d1 128 sel src 0.0.0.0/0 dst 0.0.0.0/0
ip x s a src 0.0.0.0 dst 10.244.3.114 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0xd00 mask 0xf00 output-mark 0xd00 mask 0xffffff00 aead 'rfc4106(gcm(aes))' 0x42a89014074c49243219a20cc87cadd7b9c0d7d1 128 sel src 0.0.0.0/0 dst 0.0.0.0/0
```

When trying to delete the first xfrm state using the following command,
Linux kernel will instead remove the second xfrm state but keep the
first:

```
ip x s d src 10.244.1.43 dst 10.244.3.114 proto esp spi 0x00000003 mark 0x2a450d00 mask 0xffff0f00
```

This causes troubles for cilium upgrade.

A real world scenario for cilium upgrade from 1.13.12 to 1.13.14 could
be like:
1. Before upgrade, the node has "old-style" xfrm state to catch mark
   "0xd00/0xf00" for ingress traffic; old bpf programs also set "0xd00"
   mark to ingress skbs;
2. Upgrade begins, bpf programs are reloaded to new version, thereafter
   ingress skbs are marked with "0xXXXX0d00";
3. After a short while, cilium-agent installs new xfrm states to catch
   traffic with specific mark "0xXXXX0d00";

During window between step 2 and 3, cilium relies on "old-style" xfrm
states "0xd00/0xf00" to catch traffic with specific mark "0xXXXX0d00".

So far so good.

However, in a large scale cluster it's inevitable to receive
NodeDeletion events during upgrade due to node churn. Once seeing a
NodeDeletion event, cilium-agent will remove the xfrm state for that
gone-away remote node.

Now we hit the aforementioned kernel issue: cilium-agent tries to delete
the xfrm state catching more specific mark, but kernel wrongly removes
the one catching general mark.

This causing traffic disruption until upgrade completes with all new xfrm
states installed.

This patch provides an elegant solution at low cost: if cilium-agent
wants to remove a xfrm state catching specific mark, it has to
temporarily remove the xfrm state catching general mark first and add it
back after:
1. Temporarily remove the xfrm states catching the general mark;
2. Remove the xfrm state we really care abot;
3. Add back the temporaily removed one on step 1;

Indeed there will be a small window between temporary removing and
adding back, but our past test shows the window lasts 200-900µs only, so short
that we shoudn't see many drops.

Suggested-by: Julian Wiedmann <[email protected]>
Signed-off-by: gray <[email protected]>
  • Loading branch information
jschwinger233 committed May 14, 2024
1 parent ab12313 commit 71f722f
Showing 1 changed file with 45 additions and 1 deletion.
46 changes: 45 additions & 1 deletion pkg/datapath/linux/ipsec/ipsec_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -761,14 +761,58 @@ func ipsecDeleteXfrmState(nodeID uint16) error {

errs := resiliency.NewErrorSet(fmt.Sprintf("failed to delete node (%d) xfrm states", nodeID), len(xfrmStateList))
for _, s := range xfrmStatesToDelete {
if err := netlink.XfrmStateDel(&s); err != nil {
if err := safeDeleteXfrmState(&s, xfrmStateList); err != nil {
errs.Add(fmt.Errorf("failed to delete xfrm state (%s): %w", s.String(), err))
}
}

return errs.Error()
}

// safeDeleteXfrmState deletes the given XFRM state. Specifically, if the
// state is to catch ingress traffic marked with nodeID (0xXXXX0d00), we
// temporarily remove the old XFRM state that matches 0xd00/0xf00. This is to
// workaround a kernel issue that prevents us from deleting a specific XFRM
// state (e.g. catching 0xXXXX0d00/0xffff0f00) when there is also a general
// xfrm state (e.g. catching 0xd00/0xf00). When both XFRM states coexist,
// kernel deletes the general XFRM state instead of the specific one, even if
// the deleting request is for the specific one.
func safeDeleteXfrmState(state *netlink.XfrmState, stateList []netlink.XfrmState) (err error) {
if getDirFromXfrmMark(state.Mark) == dirIngress && ipsec.GetNodeIDFromXfrmMark(state.Mark) != 0 {
oldXFRMInMark := &netlink.XfrmMark{
Value: linux_defaults.RouteMarkDecrypt,
Mask: linux_defaults.IPsecMarkBitMask,
}

errs := resiliency.NewErrorSet("failed to delete old xfrm states", len(stateList))

scopedLog := log.WithFields(logrus.Fields{
logfields.SPI: state.Spi,
logfields.SourceIP: state.Src,
logfields.DestinationIP: state.Dst,
logfields.TrafficDirection: getDirFromXfrmMark(state.Mark),
logfields.NodeID: getNodeIDAsHexFromXfrmMark(state.Mark),
})

for _, s := range stateList {
if s.Spi == state.Spi && xfrmIPEqual(s.Dst, state.Dst) && xfrmMarkEqual(s.Mark, oldXFRMInMark) {
err, deferFn := xfrmTemporarilyRemoveState(scopedLog, s, string(dirIngress))
if err != nil {
errs.Add(fmt.Errorf("Failed to remove old XFRM %s state %s: %w", string(dirIngress), s.String(), err))
} else {
defer deferFn()
}
}
}
if err := errs.Error(); err != nil {
scopedLog.WithError(err).Error("Failed to clean up old XFRM state")
return err
}
}

return netlink.XfrmStateDel(state)
}

func ipsecDeleteXfrmPolicy(nodeID uint16) error {
scopedLog := log.WithFields(logrus.Fields{
logfields.NodeID: nodeID,
Expand Down

0 comments on commit 71f722f

Please sign in to comment.