Fix ConnectionStateHandler logic around leave timer & stashed ops workflow #21110

vladsud · 2024-05-16T06:23:46Z

Add UT case that catches the problem (if no other changes are made).
Remove incorrect code that starts leave timer where it's not required (ConnectionStateHandler.ts)
Suggests possible fix (Container.ts) for stashed ops worfklow. It's a bit ugly. Happy to hear better ideas.

Overall note: I think that's not the only shoe to drop. Exposing clientId from previous client that early in boot sequence, and essentially making some events go backwards in time (addMember / removeMember events in Quorum fire after we have established clientId) may be problematic in other areas, including in consumers of FF. In other words, users of the system may not expect such reverse flow of events. In places where they run into it, they may not understand why and how it happens, and may produce incorrect changes to deal with them, or reduce invariant checks and thus reduce code integrity.

packages/loader/container-loader/src/connectionStateHandler.ts

msfluid-bot · 2024-05-16T23:48:41Z

⯅ @fluid-example/bundle-size-tests: +42 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	453.34 KB	453.34 KB	■ No change
azureClient.js	552.54 KB	552.55 KB	⯅ +12 Bytes
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	257.02 KB	257.02 KB	■ No change
fluidFramework.js	359.88 KB	359.88 KB	■ No change
loader.js	134.34 KB	134.36 KB	⯅ +18 Bytes
map.js	41.53 KB	41.53 KB	■ No change
matrix.js	143.75 KB	143.75 KB	■ No change
odspClient.js	520.9 KB	520.91 KB	⯅ +12 Bytes
odspDriver.js	97.3 KB	97.3 KB	■ No change
odspPrefetchSnapshot.js	42.16 KB	42.16 KB	■ No change
sharedString.js	160.27 KB	160.27 KB	■ No change
sharedTree.js	359.86 KB	359.86 KB	■ No change
Total Size	3.2 MB	3.2 MB	⯅ +42 Bytes

Baseline commit: 7dea35e

Generated by 🚫 dangerJS against d203b90

markfields · 2024-05-22T19:21:53Z

packages/loader/container-loader/src/test/connectionStateHandler.spec.ts

@@ -1139,6 +1143,48 @@ describe("ConnectionStateHandler Tests", () => {
 },
 );

+ it("test 'read' reconnect & races ", async () => {
+ connectionStateHandler = createHandler(
+ false, // connectedRaisedWhenCaughtUp,


Does the test require this? Better to use the default values if possible

It does - I pass readClientsWaitForJoinSignal = true, while default uses false for both arguments

Looks like they default to true to me. In createConnectionStateHandler, for each it is looking for a "Disable" flag, and if not set will yield true.

packages/loader/container-loader/src/test/connectionStateHandler.spec.ts

markfields · 2024-05-22T19:31:59Z

packages/loader/container-loader/src/test/connectionStateHandler.spec.ts

+ connectionStateHandler.receivedConnectEvent(connectionDetails2);
+
+ // Clear Audience
+ connectionStateHandler_receivedLeaveSignalEvent(connectionDetails.clientId);


Is this necessary for the test?

Oh I see, it represents the removal of this from the audience. Of course.

If this line is removed, then connection 2 does need to wait for connection 1 to leave, right?

Ok I'm confused 😅 The redundant Join Signal (next line of code) adds the first clientId back to the Audience. I'll go look at the code to see what's going on (started reviewing with tests first)

I'm a little nervous about the logic there, given the volatility of leave/join signals. It's quite a rigid state machine, and it may not fit anymore.

Rather than maintaining a state machine based on the sequence of the Join/Leave events (could be ops or signals), can we be quicker to check the audience at key points?

Specifically, the places that currently check if we're "waiting for the leave op" should first check if the old clientId is still in the quorum, and if not cancel the timer. I just don't trust that every client will get the full sequence of leaves and joins.

Definitely some FUD here, open to a reasonable explanation of what guarantees we do have with the signals.

I'm simply simulating how ConnectionStateHandler observes reconnection flow. It will see Audience being fully cleared, and then repopulated with new state. I'm adding comment to clarify that.

Just to make sure we are on the same page: If client was connected as "read", we never wait for "leave" signal for it. So, number of times it shows up again in Audience (because we reconnect to different front-end) should not matter (that's what this PR fixes).

For "write" connections, we make determination (to raise timer and wait for it or not) only on disconnect and stick with this decision. We do not re-evaluate it, and if timer is active, we proceed to next step only when old clientId disappears from quorum.

I think we enforce what you are asking for - see applyForConnectedState():

assert( !this.waitingForLeaveOp || this.hasMember(this.clientId), 0x2e2 /* "Must only wait for leave message when clientId in quorum" */, );

and then there is only one place that cancels timer:

private receivedRemoveMemberEvent(clientId: string) { // If the client which has left was us, then finish the timer. if (this.clientId === clientId) { this.prevClientLeftTimer.clear(); this.applyForConnectedState("removeMemberEvent"); } }

(Ideally it should check also if timer is present, I'll look more into it).

I actually do not see a reason to check at random places if client is still in quorum, as there should be no way for us to miss that event.

Let me know what you think

packages/loader/container-loader/src/test/connectionStateHandler.spec.ts

markfields · 2024-05-22T21:07:11Z

packages/loader/container-loader/src/test/connectionStateHandler.spec.ts

+ connectionStateHandler_receivedJoinSignalEvent(connectionDetails);
+ connectionStateHandler_receivedJoinSignalEvent(connectionDetails2);
+
+ // It should not wait for leave of connectionDetails.clientId


I'm not sure... how do we know we can ignore that duplicate Join? We have no indication it was duplicate.

In this test, the Leave signal triggers this code, right?

this.applyForConnectedState("removeMemberEvent");

So then the concept is that we have cleared that old client, and when the duplicate Join comes in, even though its clientId matches ours, we ignore it, because who cares.

Hmmm maybe it's ok. Weird.

It's not "duplicate" from Audience POV - Audience is cleared before we process "initial signals" on a new connection. But yes, same ID "joins" audience again.

There is nothing to ignore here by ConnectionStateHandler - it simple is indifferent to that event.
It's not really "ours", I think that's the key. ConnectionStateHandler should make right calls on loss of connection (RE wait for leave or not), but after that it operates only with pendingClientId.

That said, yes, I agree that applyForConnectedState() overall looks a bit wrong (in terms of else branch).
It's not just applyForConnectedState("removeMemberEvent"), even as unrelated as applyForConnectedState("containerSaved") seems like can trigger "connectedStateRejected" event that looks wrong to me.

Ah, containerSaved() cancels timeout before calling applyForConnectedState("containerSaved").
Same for receivedRemoveMemberEvent()
And we produce error event only for "timeout", so that looks Ok (i.e. "connectedStateRejected" event is information only, and can show up number of times for same connection).

packages/loader/container-loader/src/container.ts

markfields · 2024-05-22T23:20:59Z

packages/loader/container-loader/src/container.ts

+ // IN other words, if connectionStateHandler has access to Quorum early in load sequence, it will see events (in stashed ops mode)
+ // in the order that is not possible in real life, that it may not expect.
+ // Ideally, we should supply pendingLocalState?.clientId here as well, not in constructor, but it does not matter (at least today)
+ this.connectionStateHandler.initProtocol(this.protocolHandler);


This makes sense, but wanted to think out loud about it too since this kind of move is pretty subtle.

I looked at all the code that runs between the old spot and new spot. The critical piece is obviously replaying stashed ops ("saved ops"), but it also will do ops fetch. Any concerns with delaying this past that point?

What I can think of is that other clients' join/leave ops could come in, but those never matter to ConnectionStateHandler, it is only concerned with its own.

And this function is all about readying the container to be connected, so it makes sense to go here.

Correct. There is some risk involved here, and to be honest - I'd rather look (over time) at keeping this call where it was, but change when / how we communicate pendingLocalState.clientId to connectionStateHandler (make it know about such clientId much later in load sequence). I'll look more into it, maybe it's not that hard - not sure.

…to ReconnectBug

markfields · 2024-05-23T20:21:58Z

packages/loader/container-loader/src/connectionStateHandler.ts

+ // we might wait even if we could avoid such wait.
+ if (
+ this._clientId !== undefined &&
+ protocol.quorum?.getMember(this._clientId) !== undefined


You're checking quorum instead of membership (could be audience), because if this._clientId was a read connection we don't care -- right?

Corrent. And it's equivalent to how we do it in main flow.

markfields · 2024-05-23T20:25:06Z

packages/loader/container-loader/src/connectionStateHandler.ts

+ // This mimicks check in setConnectionState()
+ // Note that we are not consulting this.handler.shouldClientJoinWrite() here


Suggested change

// This mimicks check in setConnectionState()

// Note that we are not consulting this.handler.shouldClientJoinWrite() here

// This mimics check in setConnectionState(), BUT we are not consulting this.handler.shouldClientJoinWrite() here

…kflow (microsoft#21110) 1. Add UT case that catches the problem (if no other changes are made). 2. Remove incorrect code that starts leave timer where it's not required (ConnectionStateHandler.ts) 3. Suggests possible fix (Container.ts) for stashed ops worfklow. It's a bit ugly. Happy to hear better ideas. Overall note: I think that's not the only shoe to drop. Exposing clientId from previous client that early in boot sequence, and essentially making some events go backwards in time (addMember / removeMember events in Quorum fire after we have established clientId) may be problematic in other areas, including in consumers of FF. In other words, users of the system may not expect such reverse flow of events. In places where they run into it, they may not understand why and how it happens, and may produce incorrect changes to deal with them, or reduce invariant checks and thus reduce code integrity.

Port two ContainerStateHandler PRs to RC4: #21223 #21110

Remove incorrect code that starts leave timer where it's not required

fa132ce

vladsud requested review from markfields and jatgarg May 16, 2024 06:23

github-actions bot added area: loader Loader related issues base: main PRs targeted against main branch labels May 16, 2024

Fix UTs

e80feb0

github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label May 16, 2024

markfields reviewed May 16, 2024

View reviewed changes

packages/loader/container-loader/src/connectionStateHandler.ts Show resolved Hide resolved

UT & fix

c431b68

vladsud changed the title ~~Fix ConnectionStateHandler logic around leave timer~~ Fix ConnectionStateHandler logic around leave timer & stashed ops workflow May 16, 2024

vladsud requested a review from anthony-murphy May 16, 2024 23:14

vladsud marked this pull request as ready for review May 17, 2024 15:42