Idempotent drop_generation #13072

ieQu1 · 2024-05-19T11:55:28Z

Release version: v/e5.7

Summary

Various small, but important fixes.

Add an option to execute read operations on the leader.
drop_generation operation can be replayed multiple times by the replication layer, but it's not idempotent. This PR adds a workaround that avoids a crash when drop_generation doesn't succeed. In the future, however, we want to make drop_generation idempotent in a nicer way.
Wrap storage layer events in a small structure containing the generation ID, to make sure events are handled by the same layout CBM & context that produced them.
Fix crash when storage event arrives to the dropped generation (there was a bug in storage_layer:generation_at function, which has been removed for good).
Implement format_status callback for several workers to avoid dumping contents of the entire message buffer to the console.
Move the responsibility of end_of_stream detection to the layout CBM. Previously storage layer used a heuristic: old generations that return an empty batch won't produce more data. This is, obviously, incorrect: for example, bitfield-LTS layout MAY return empty batch while waiting for safe cutoff time.
reference layout has been enabled in prod build. It could be useful for integration testing.
Fix incorrect epoch calculation in bitfield_lts:handle_event callback that lead to missed safe cutoff time updates, and effectively, subscribers being unable to fetch messages until a fresh batch was published.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Added property-based tests for code which performs user input validation
Changed lines covered in coverage report
Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

If changed package build workflow, pass this action (manual trigger)
Change log has been added to changes/ dir for user-facing artifacts update

ieQu1 · 2024-05-19T22:45:32Z

apps/emqx_durable_storage/test/emqx_ds_storage_SUITE.erl

@@ -27,25 +27,6 @@ opts() ->

 %%

-t_idempotent_store_batch(_Config) ->


Well, this doesn't hold, since writing batches to the old generations breaks all kinds of assumptions about the replay.
@keynslug Is this property important for e.g. replication layer? Can it replay add_generation and store_batch out of order?

We had a discussion about that behavior, and settled down on the approach "preserve idempotency as much as possible", basically. Initially the approach was "drop batches targeting outdated generations on the floor".

Can it replay add_generation and store_batch out of order?

It sort of depends on what "out-of-order" means here. The class of situations this was introduced to cover more or less boils down to:

Replica applied store_batch, add_generation, store_batch to the storage state SS0.

Storage state becomes SS1.

Something happened, e.g. SS1 transferred as a snapshot to another replica.

Replica applies the same sequence (e.g. from the Raft log) to the storage state SS1.

Well, commit_batch is still idempotent, since prepare adds generation id into the cooked batch. I think splitting the operation should ultimately help with this. There are two obstacles:

Assuming we run prepare in the aux server on leader:

ra doesn't provide an API to call aux server with a timeout (can be worked around)

there's no API to get leader from a node that is not part of the shard cluster.

apps/emqx_durable_storage/src/emqx_ds_storage_layer.erl

apps/emqx_durable_storage/src/emqx_ds_replication_layer_shard.erl

apps/emqx/src/emqx_ds_schema.erl

ieQu1 · 2024-05-21T16:08:02Z

apps/emqx_durable_storage/src/proto/emqx_ds_proto_v4.erl

@@ -179,8 +179,7 @@ make_delete_iterator(Node, DB, Shard, Stream, TopicFilter, StartTime) ->
 | {ok, end_of_stream}
 | {error, _}.
 delete_next(Node, DB, Shard, Iter, Selector, BatchSize) ->
- emqx_rpc:call(
- Shard,
+ erpc:call(


Using janky gen_rpc is only justified when forwarding large binaries, however this is not the case here.

apps/emqx_durable_storage/src/emqx_ds_replication_layer_shard.erl

thalesmg · 2024-05-21T19:42:33Z

apps/emqx_durable_storage/src/emqx_ds_storage_bitfield_lts.erl

@@ -546,13 +561,15 @@ handle_event(_ShardId, State = #s{gvars = Gvars}, Time, tick) ->
 LastWrittenTs = 0
 end,
 case Latch of
- false when ?EPOCH(State, Time) > ?EPOCH(State, LastWrittenTs) ->
+ false when ?EPOCH(State, Time) > ?EPOCH(State, LastWrittenTs) + 1 ->


Q: why the + 1 is needed here?

It delays event emission by one epoch to add a little safety margin.

P.S. Added a comment.

apps/emqx_durable_storage/src/emqx_ds_storage_bitfield_lts.erl

keynslug · 2024-05-22T09:37:09Z

apps/emqx_durable_storage/src/emqx_ds_storage_layer.erl

 %% NOTE
 %% We assume that batches do not span generations. Callers should enforce this.
 ?tp(emqx_ds_storage_layer_prepare_batch, #{
 shard => Shard, messages => Messages, options => Options
 }),
- {GenId, #{module := Mod, data := GenData}} = generation_at(Shard, Time),
+ GenId = generation_current(Shard),


I suspect this could introduce message duplication under snapshot transfer + log replay scenarios. I.e. imagine a following sequence of entries in the logs:

{T1 = 42, Msg1}

{T2 = 43, Msg2}

{T3 = 44, add_generation}

{T4 = 45, Msg3}

If they are reapplied on top of the storage state that already had them applied, couldn't Msg1 and Msg2 end up both in previous and in the current generation?

Yes, it will happen. But running prepare in the main state machine code should be removed. Prepare should run on the leader (in aux callback), and add a cooked batch to the raft log. Committing cooked batch to the storage is idempotent.

keynslug · 2024-05-22T09:40:44Z

apps/emqx_durable_storage/src/emqx_ds_storage_layer.erl

- %% any more messages here. The iterator reached the end:
- %% the stream has been fully replayed.
- {ok, end_of_stream};
+ IsCurrent = GenId =:= generation_current(Shard),


I wonder if this loss of write idempotency could be alleviated by picking the "current" generation according to Now here, instead of the most recent storage state? I have a feeling this also relaxes the requirement to read from leaders, but am not 100% sure.

The risk here I guess is having Now pointing to a dropped generation (from the storage state perspective), but picking the earliest alive in that case should work the same I guess.

Reduce volume of logs and crash reports from DS

Make sure storage events originating from generation X are handled in the context of the same generation.

emqxqa · 2024-05-23T02:58:26Z

TestExecution

ieQu1 force-pushed the dev/fix-drop-generation branch 2 times, most recently from 9ef5ef9 to bd6ba87 Compare May 19, 2024 18:36

ieQu1 changed the title ~~Dev/fix drop generation~~ Idempotent drop_generation May 19, 2024

ieQu1 force-pushed the dev/fix-drop-generation branch from 8991d9f to 7a310c3 Compare May 19, 2024 22:43

ieQu1 commented May 19, 2024

View reviewed changes

keynslug reviewed May 20, 2024

View reviewed changes

apps/emqx_durable_storage/src/emqx_ds_storage_layer.erl Outdated Show resolved Hide resolved

apps/emqx_durable_storage/src/emqx_ds_replication_layer_shard.erl Show resolved Hide resolved

ieQu1 force-pushed the dev/fix-drop-generation branch from be9617c to dd852d4 Compare May 20, 2024 22:30

ieQu1 commented May 20, 2024

View reviewed changes

apps/emqx/src/emqx_ds_schema.erl Outdated Show resolved Hide resolved

ieQu1 force-pushed the dev/fix-drop-generation branch 2 times, most recently from 06fc3df to d70b3f3 Compare May 21, 2024 15:59

ieQu1 commented May 21, 2024

View reviewed changes

ieQu1 force-pushed the dev/fix-drop-generation branch 4 times, most recently from 1be35fc to 50e04a0 Compare May 21, 2024 18:09

ieQu1 marked this pull request as ready for review May 21, 2024 18:14

ieQu1 requested review from a team and lafirest as code owners May 21, 2024 18:14

thalesmg previously approved these changes May 21, 2024

View reviewed changes

ieQu1 dismissed thalesmg’s stale review via 4a4df95 May 21, 2024 20:26

keynslug reviewed May 22, 2024

View reviewed changes

ieQu1 force-pushed the dev/fix-drop-generation branch from 4a4df95 to 4358077 Compare May 22, 2024 10:33

ieQu1 added 8 commits May 22, 2024 18:01

fix(ds): Use erpc instead of gen_rpc for `delete_next'

4580906

fix(ds): Perform read operations on the leader.

c6fc76e

fix(ds): Fix return type of drop_generation

aca2d95

fix(ds): Log generation operations

1526c52

feat(ds): Implement format_status callback

e4a73f0

Reduce volume of logs and crash reports from DS

test(ds): Refactor ds_SUITE

074d98a

fix(ds): Always store messages in the current generation

eb7c43e

fix(ds): Workaround for the idempotency error when dropping gens

acdae4f

ieQu1 added 6 commits May 22, 2024 18:01

fix(ds): Include generation ID in the storage events

0ff307e

Make sure storage events originating from generation X are handled in the context of the same generation.

fix(ds): Move responsibility of returning end_of_stream to the CBM

60edf5e

fix(ds): Fix code review remark

b3ded7e

fix(ds): Fix idle event generation in bitfield_lts layout

29345aa

docs(ds): Update documentation for the storage layouts

5c78ecb

fix(ds): Apply review remarks

59a09fb

ieQu1 force-pushed the dev/fix-drop-generation branch from 4358077 to 59a09fb Compare May 22, 2024 16:14

fix(ds): Allow to write batches to older generations

6eb04f9

thalesmg approved these changes May 22, 2024

View reviewed changes

ieQu1 merged commit bfd01c4 into emqx:release-57 May 22, 2024
171 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idempotent drop_generation #13072

Idempotent drop_generation #13072

ieQu1 commented May 19, 2024 •

edited

ieQu1 May 19, 2024 •

edited

keynslug May 20, 2024

ieQu1 May 20, 2024 •

edited

ieQu1 May 21, 2024

thalesmg May 21, 2024

ieQu1 May 21, 2024

ieQu1 May 21, 2024

keynslug May 22, 2024

ieQu1 May 22, 2024

keynslug May 22, 2024

emqxqa commented May 23, 2024 •

edited

		@@ -27,25 +27,6 @@ opts() ->

		%%

		t_idempotent_store_batch(_Config) ->

Idempotent drop_generation #13072

Idempotent drop_generation #13072

Conversation

ieQu1 commented May 19, 2024 • edited

Summary

PR Checklist

Checklist for CI (.github/workflows) changes

ieQu1 May 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ieQu1 May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emqxqa commented May 23, 2024 • edited

ieQu1 commented May 19, 2024 •

edited

ieQu1 May 19, 2024 •

edited

ieQu1 May 20, 2024 •

edited

emqxqa commented May 23, 2024 •

edited