Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/endpoint: make state synchronization atomic #32439

Merged
merged 2 commits into from May 13, 2024

Conversation

lmb
Copy link
Contributor

@lmb lmb commented May 9, 2024

pkg/endpoint: make state synchronization atomic

BPF regeneration writes state into a new temporary directory. Once it has
succeeded we need to swap the old and new directory. This is currently 
achieved by "backing up" the current state by renaming the directory. This
code has a bunch of corner cases around cleaning up old directories and so
on which are necessary since the synchronization isn't truly atomic.

Instead, use the RENAME_EXCHANGE flag to atomically exchange the two 
existing directories. Also use hard links to retain existing state so that
killing the agent during a synchronization doesn't lead to corruption.

Signed-off-by: Lorenz Bauer <[email protected]>

pkg/endpoint: always copy existing state during synchronization

Endpoint regeneration goes to a lot of trouble to keep track of whether the
on-disk state has changed, only to avoid doing a couple of readdir syscalls.
The behaviour was added in commit  f6c4385a43
("pkg/endpoint: Keep BPF object files if compilation is skipped.") and the
message does not indicate that performance was of particular concern.

Do the safe thing and always perform the state copy.

Signed-off-by: Lorenz Bauer <[email protected]>

This is a pre-requisite to a large PR in which I'm reshuffling how the loader does Endpoint templating. The second commit in this PR makes that easier.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 9, 2024
@lmb lmb added release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels May 9, 2024
@lmb
Copy link
Contributor Author

lmb commented May 9, 2024

/test

@lmb lmb force-pushed the pr/lmb/endpoint-use-replace-exchange branch from 1e78f20 to b4a4f46 Compare May 9, 2024 14:42
@lmb
Copy link
Contributor Author

lmb commented May 9, 2024

/test

BPF regeneration writes state into a new temporary directory. Once it
has succeeded we need to swap the old and new directory. This is currently
achieved by "backing up" the current state by renaming the directory.
This code has a bunch of corner cases around cleaning up old directories
and so on which are necessary since the synchronization isn't truly
atomic.

Instead, use the RENAME_EXCHANGE flag to atomically exchange the two
existing directories. Also use hard links to retain existing state
so that killing the agent during a synchronization doesn't lead
to corruption.

Signed-off-by: Lorenz Bauer <[email protected]>
@lmb lmb force-pushed the pr/lmb/endpoint-use-replace-exchange branch from b4a4f46 to b220ee5 Compare May 10, 2024 08:36
@lmb
Copy link
Contributor Author

lmb commented May 10, 2024

/test

@lmb lmb changed the title pkg/endpoint: simplify state directory synchronization pkg/endpoint: make state synchronization atomic May 10, 2024
@lmb lmb marked this pull request as ready for review May 10, 2024 08:57
@lmb lmb requested a review from a team as a code owner May 10, 2024 08:57
@lmb lmb requested a review from nathanjsweet May 10, 2024 08:57
Endpoint regeneration goes to a lot of trouble to keep track of
whether the on-disk state has changed, only to avoid doing a couple
of readdir syscalls. The behaviour was added in commit  f6c4385
("pkg/endpoint: Keep BPF object files if compilation is skipped.")
and the message does not indicate that performance was of particular
concern.

Do the safe thing and always perform the state copy.

Signed-off-by: Lorenz Bauer <[email protected]>
@lmb
Copy link
Contributor Author

lmb commented May 10, 2024

/test

Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 13, 2024
@nathanjsweet nathanjsweet added this pull request to the merge queue May 13, 2024
Merged via the queue into cilium:main with commit 900c846 May 13, 2024
64 checks passed
@adamwathieu
Copy link

Thanks for the fix @lmb! Since this issue #15446 can be hit in any version, should we backport it to 1.13, 1.14, 1.15?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants