-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ConnectionStateHandler logic around leave timer & stashed ops workflow #21110
Conversation
⯅ @fluid-example/bundle-size-tests: +42 Bytes
Baseline commit: 7dea35e |
@@ -1139,6 +1143,48 @@ describe("ConnectionStateHandler Tests", () => { | |||
}, | |||
); | |||
|
|||
it("test 'read' reconnect & races ", async () => { | |||
connectionStateHandler = createHandler( | |||
false, // connectedRaisedWhenCaughtUp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the test require this? Better to use the default values if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does - I pass readClientsWaitForJoinSignal = true, while default uses false for both arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they default to true to me. In createConnectionStateHandler
, for each it is looking for a "Disable" flag, and if not set will yield true.
packages/loader/container-loader/src/test/connectionStateHandler.spec.ts
Show resolved
Hide resolved
packages/loader/container-loader/src/test/connectionStateHandler.spec.ts
Show resolved
Hide resolved
connectionStateHandler.receivedConnectEvent(connectionDetails2); | ||
|
||
// Clear Audience | ||
connectionStateHandler_receivedLeaveSignalEvent(connectionDetails.clientId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, it represents the removal of this from the audience. Of course.
If this line is removed, then connection 2 does need to wait for connection 1 to leave, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm confused 😅 The redundant Join Signal (next line of code) adds the first clientId back to the Audience. I'll go look at the code to see what's going on (started reviewing with tests first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about the logic there, given the volatility of leave/join signals. It's quite a rigid state machine, and it may not fit anymore.
Rather than maintaining a state machine based on the sequence of the Join/Leave events (could be ops or signals), can we be quicker to check the audience at key points?
Specifically, the places that currently check if we're "waiting for the leave op" should first check if the old clientId is still in the quorum, and if not cancel the timer. I just don't trust that every client will get the full sequence of leaves and joins.
Definitely some FUD here, open to a reasonable explanation of what guarantees we do have with the signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm simply simulating how ConnectionStateHandler observes reconnection flow. It will see Audience being fully cleared, and then repopulated with new state. I'm adding comment to clarify that.
Just to make sure we are on the same page: If client was connected as "read", we never wait for "leave" signal for it. So, number of times it shows up again in Audience (because we reconnect to different front-end) should not matter (that's what this PR fixes).
For "write" connections, we make determination (to raise timer and wait for it or not) only on disconnect and stick with this decision. We do not re-evaluate it, and if timer is active, we proceed to next step only when old clientId disappears from quorum.
I think we enforce what you are asking for - see applyForConnectedState():
assert(
!this.waitingForLeaveOp || this.hasMember(this.clientId),
0x2e2 /* "Must only wait for leave message when clientId in quorum" */,
);
and then there is only one place that cancels timer:
private receivedRemoveMemberEvent(clientId: string) {
// If the client which has left was us, then finish the timer.
if (this.clientId === clientId) {
this.prevClientLeftTimer.clear();
this.applyForConnectedState("removeMemberEvent");
}
}
(Ideally it should check also if timer is present, I'll look more into it).
I actually do not see a reason to check at random places if client is still in quorum, as there should be no way for us to miss that event.
Let me know what you think
packages/loader/container-loader/src/test/connectionStateHandler.spec.ts
Outdated
Show resolved
Hide resolved
packages/loader/container-loader/src/test/connectionStateHandler.spec.ts
Show resolved
Hide resolved
connectionStateHandler_receivedJoinSignalEvent(connectionDetails); | ||
connectionStateHandler_receivedJoinSignalEvent(connectionDetails2); | ||
|
||
// It should not wait for leave of connectionDetails.clientId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... how do we know we can ignore that duplicate Join? We have no indication it was duplicate.
In this test, the Leave signal triggers this code, right?
this.applyForConnectedState("removeMemberEvent");
So then the concept is that we have cleared that old client, and when the duplicate Join comes in, even though its clientId matches ours, we ignore it, because who cares.
Hmmm maybe it's ok. Weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not "duplicate" from Audience POV - Audience is cleared before we process "initial signals" on a new connection. But yes, same ID "joins" audience again.
There is nothing to ignore here by ConnectionStateHandler - it simple is indifferent to that event.
It's not really "ours", I think that's the key. ConnectionStateHandler should make right calls on loss of connection (RE wait for leave or not), but after that it operates only with pendingClientId.
That said, yes, I agree that applyForConnectedState() overall looks a bit wrong (in terms of else branch).
It's not just applyForConnectedState("removeMemberEvent"), even as unrelated as applyForConnectedState("containerSaved") seems like can trigger "connectedStateRejected" event that looks wrong to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, containerSaved() cancels timeout before calling applyForConnectedState("containerSaved").
Same for receivedRemoveMemberEvent()
And we produce error event only for "timeout", so that looks Ok (i.e. "connectedStateRejected" event is information only, and can show up number of times for same connection).
// IN other words, if connectionStateHandler has access to Quorum early in load sequence, it will see events (in stashed ops mode) | ||
// in the order that is not possible in real life, that it may not expect. | ||
// Ideally, we should supply pendingLocalState?.clientId here as well, not in constructor, but it does not matter (at least today) | ||
this.connectionStateHandler.initProtocol(this.protocolHandler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, but wanted to think out loud about it too since this kind of move is pretty subtle.
I looked at all the code that runs between the old spot and new spot. The critical piece is obviously replaying stashed ops ("saved ops"), but it also will do ops fetch. Any concerns with delaying this past that point?
What I can think of is that other clients' join/leave ops could come in, but those never matter to ConnectionStateHandler, it is only concerned with its own.
And this function is all about readying the container to be connected, so it makes sense to go here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. There is some risk involved here, and to be honest - I'd rather look (over time) at keeping this call where it was, but change when / how we communicate pendingLocalState.clientId to connectionStateHandler (make it know about such clientId much later in load sequence). I'll look more into it, maybe it's not that hard - not sure.
// we might wait even if we could avoid such wait. | ||
if ( | ||
this._clientId !== undefined && | ||
protocol.quorum?.getMember(this._clientId) !== undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're checking quorum instead of membership (could be audience), because if this._clientId
was a read connection we don't care -- right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrent. And it's equivalent to how we do it in main flow.
// This mimicks check in setConnectionState() | ||
// Note that we are not consulting this.handler.shouldClientJoinWrite() here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This mimicks check in setConnectionState() | |
// Note that we are not consulting this.handler.shouldClientJoinWrite() here | |
// This mimics check in setConnectionState(), BUT we are not consulting this.handler.shouldClientJoinWrite() here |
…kflow (microsoft#21110) 1. Add UT case that catches the problem (if no other changes are made). 2. Remove incorrect code that starts leave timer where it's not required (ConnectionStateHandler.ts) 3. Suggests possible fix (Container.ts) for stashed ops worfklow. It's a bit ugly. Happy to hear better ideas. Overall note: I think that's not the only shoe to drop. Exposing clientId from previous client that early in boot sequence, and essentially making some events go backwards in time (addMember / removeMember events in Quorum fire after we have established clientId) may be problematic in other areas, including in consumers of FF. In other words, users of the system may not expect such reverse flow of events. In places where they run into it, they may not understand why and how it happens, and may produce incorrect changes to deal with them, or reduce invariant checks and thus reduce code integrity.
Overall note: I think that's not the only shoe to drop. Exposing clientId from previous client that early in boot sequence, and essentially making some events go backwards in time (addMember / removeMember events in Quorum fire after we have established clientId) may be problematic in other areas, including in consumers of FF. In other words, users of the system may not expect such reverse flow of events. In places where they run into it, they may not understand why and how it happens, and may produce incorrect changes to deal with them, or reduce invariant checks and thus reduce code integrity.