New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: lxc: simplify RevNAT path for loopback replies #32480
Open
julianwiedmann
wants to merge
2
commits into
cilium:main
Choose a base branch
from
julianwiedmann:1.16-bpf-loopback-revnat
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
bpf: lxc: simplify RevNAT path for loopback replies #32480
julianwiedmann
wants to merge
2
commits into
cilium:main
from
julianwiedmann:1.16-bpf-loopback-revnat
+200
−134
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
julianwiedmann
added
release-note/misc
This PR makes changes that have no direct user impact.
kind/tech-debt
Technical debt
area/loadbalancing
Impacts load-balancing and Kubernetes service implementations
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
labels
May 11, 2024
/test |
julianwiedmann
force-pushed
the
1.16-bpf-loopback-revnat
branch
from
May 11, 2024 09:27
5ecbae1
to
a7a8d53
Compare
/test |
julianwiedmann
force-pushed
the
1.16-bpf-loopback-revnat
branch
2 times, most recently
from
May 11, 2024 09:33
c4a430e
to
afd51f7
Compare
/test |
julianwiedmann
force-pushed
the
1.16-bpf-loopback-revnat
branch
from
May 11, 2024 09:59
afd51f7
to
5c1b041
Compare
/test |
julianwiedmann
changed the title
1.16 bpf loopback revnat
bpf: lxc: simplify RevNAT path for loopback replies
May 17, 2024
julianwiedmann
force-pushed
the
1.16-bpf-loopback-revnat
branch
from
May 17, 2024 08:18
5c1b041
to
5e7d3f5
Compare
/test |
julianwiedmann
added
the
dont-merge/preview-only
Only for preview or testing, don't merge it.
label
May 17, 2024
The usual flow for handling service traffic to a local backend is as follows: * requests are load-balanced in from-container. This entails selecting a backend (and caching the selection in a CT_SERVICE entry), DNATing the packet, creating a CT_EGRESS entry for the resulting `client -> backend` flow, applying egress network policy, and local delivery to the backend pod. As part of the local delivery, we also create a CT_INGRESS entry and apply ingress network policy. * replies bypass the backend's egress network policy (because the CT lookup returns CT_REPLY), and pass to the client via local delivery. In the client's ingress path they bypass ingress network policy (the packets match as reply against the CT_EGRESS entry), and we apply RevDNAT based on the `rev_nat_index` in the CT_EGRESS entry. For a loopback connection (where the client pod is selected as backend for the connection) this looks slightly more complicated: * As we can't establish a `client -> client` connection, the requests are also SNATed with IPV4_LOOPBACK. Network policy in forward direction is explicitly skipped (as the matched CT entries have the `.loopback` flag set). * In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a valid IP for an endpoint lookup). So a reply already gets fully RevNATed by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this means that when passing into the client pod (either via to-container, or via the ingress policy tail-call), the packet doesn't match as reply to the CT_EGRESS entry - and so we don't benefit from automatic network policy bypass. We ended up with two workarounds for this aspect: (1) when to-container is installed, it contains custom logic to match the packet as a loopback reply, and skip ingress policy (see cilium#27798). (2) otherwise we skip the ingress policy tailcall, and forward the packet straight into the client pod. The downside of these workarounds is that we bypass the *whole* ingress program, not just the network policy part. So the CT_EGRESS entry doesn't get updated (lifetime, statistics, observed packet flags, ...), and we have the hidden risk that when we add more logic to the ingress program, it doesn't get executed for loopback replies. This patch aims to eliminate the need for such workarounds. At its core, it detects loopback replies in from-container and overrides the packet's destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK, we can now look up the actual client endpoint - and deliver to the ingress policy program, *without* needing to early-RevNAT the packet. Instead the replies follow the usual packet flow, match the CT_EGRESS entry in the ingress program, naturally bypass ingress network policy, and are *then* RevNATed based on the CT_EGRESS entry's `rev_nat_index`. Consequently we follow the standard datapath, without needing to skip over policy programs. The CT_EGRESS entry is updated for every reply. Thus we can also remove the manual policy bypass for loopback replies, when using per-EP routing. It's no longer needed and in fact the replies will no longer match the lookup logic, as they haven't been RevNATed yet. This effectively reverts e2829a0 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes"). Signed-off-by: Julian Wiedmann <[email protected]>
The `hairpin_flow` parameter was previously needed so that loopback replies could bypass the ingress network policy. But now that all callers set this parameter to `false`, we can safely remove the corresponding logic. Signed-off-by: Julian Wiedmann <[email protected]>
julianwiedmann
force-pushed
the
1.16-bpf-loopback-revnat
branch
from
May 22, 2024 10:30
5e7d3f5
to
206e9a9
Compare
Rebase to resolve a trivial conflict. |
/test |
ti-mo
approved these changes
May 22, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad the hack is gone. 👍 Can't really give a meaningful comprehensive review, though.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/loadbalancing
Impacts load-balancing and Kubernetes service implementations
dont-merge/preview-only
Only for preview or testing, don't merge it.
kind/tech-debt
Technical debt
release-note/misc
This PR makes changes that have no direct user impact.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR aims to clean up some long-standing tech debt, in how we handle reply traffic for a loopback connection (a client connecting to itself through a service).
Quoting from the first patch