Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ConnectionStateHandler logic around leave timer & stashed ops workflow #21110

Merged
merged 6 commits into from
May 23, 2024

Conversation

vladsud
Copy link
Contributor

@vladsud vladsud commented May 16, 2024

  1. Add UT case that catches the problem (if no other changes are made).
  2. Remove incorrect code that starts leave timer where it's not required (ConnectionStateHandler.ts)
  3. Suggests possible fix (Container.ts) for stashed ops worfklow. It's a bit ugly. Happy to hear better ideas.

Overall note: I think that's not the only shoe to drop. Exposing clientId from previous client that early in boot sequence, and essentially making some events go backwards in time (addMember / removeMember events in Quorum fire after we have established clientId) may be problematic in other areas, including in consumers of FF. In other words, users of the system may not expect such reverse flow of events. In places where they run into it, they may not understand why and how it happens, and may produce incorrect changes to deal with them, or reduce invariant checks and thus reduce code integrity.

@github-actions github-actions bot added area: loader Loader related issues base: main PRs targeted against main branch labels May 16, 2024
@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label May 16, 2024
@vladsud vladsud changed the title Fix ConnectionStateHandler logic around leave timer Fix ConnectionStateHandler logic around leave timer & stashed ops workflow May 16, 2024
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented May 16, 2024

@fluid-example/bundle-size-tests: +42 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 453.34 KB 453.34 KB No change
azureClient.js 552.54 KB 552.55 KB +12 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 257.02 KB 257.02 KB No change
fluidFramework.js 359.88 KB 359.88 KB No change
loader.js 134.34 KB 134.36 KB +18 Bytes
map.js 41.53 KB 41.53 KB No change
matrix.js 143.75 KB 143.75 KB No change
odspClient.js 520.9 KB 520.91 KB +12 Bytes
odspDriver.js 97.3 KB 97.3 KB No change
odspPrefetchSnapshot.js 42.16 KB 42.16 KB No change
sharedString.js 160.27 KB 160.27 KB No change
sharedTree.js 359.86 KB 359.86 KB No change
Total Size 3.2 MB 3.2 MB +42 Bytes

Baseline commit: 7dea35e

Generated by 🚫 dangerJS against d203b90

@vladsud vladsud marked this pull request as ready for review May 17, 2024 15:42
@@ -1139,6 +1143,48 @@ describe("ConnectionStateHandler Tests", () => {
},
);

it("test 'read' reconnect & races ", async () => {
connectionStateHandler = createHandler(
false, // connectedRaisedWhenCaughtUp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test require this? Better to use the default values if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does - I pass readClientsWaitForJoinSignal = true, while default uses false for both arguments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they default to true to me. In createConnectionStateHandler, for each it is looking for a "Disable" flag, and if not set will yield true.

connectionStateHandler.receivedConnectEvent(connectionDetails2);

// Clear Audience
connectionStateHandler_receivedLeaveSignalEvent(connectionDetails.clientId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary for the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it represents the removal of this from the audience. Of course.

If this line is removed, then connection 2 does need to wait for connection 1 to leave, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm confused 😅 The redundant Join Signal (next line of code) adds the first clientId back to the Audience. I'll go look at the code to see what's going on (started reviewing with tests first)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about the logic there, given the volatility of leave/join signals. It's quite a rigid state machine, and it may not fit anymore.

Rather than maintaining a state machine based on the sequence of the Join/Leave events (could be ops or signals), can we be quicker to check the audience at key points?

Specifically, the places that currently check if we're "waiting for the leave op" should first check if the old clientId is still in the quorum, and if not cancel the timer. I just don't trust that every client will get the full sequence of leaves and joins.

Definitely some FUD here, open to a reasonable explanation of what guarantees we do have with the signals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm simply simulating how ConnectionStateHandler observes reconnection flow. It will see Audience being fully cleared, and then repopulated with new state. I'm adding comment to clarify that.

Just to make sure we are on the same page: If client was connected as "read", we never wait for "leave" signal for it. So, number of times it shows up again in Audience (because we reconnect to different front-end) should not matter (that's what this PR fixes).

For "write" connections, we make determination (to raise timer and wait for it or not) only on disconnect and stick with this decision. We do not re-evaluate it, and if timer is active, we proceed to next step only when old clientId disappears from quorum.

I think we enforce what you are asking for - see applyForConnectedState():

		assert(
			!this.waitingForLeaveOp || this.hasMember(this.clientId),
			0x2e2 /* "Must only wait for leave message when clientId in quorum" */,
		);

and then there is only one place that cancels timer:

	private receivedRemoveMemberEvent(clientId: string) {
		// If the client which has left was us, then finish the timer.
		if (this.clientId === clientId) {
			this.prevClientLeftTimer.clear();
			this.applyForConnectedState("removeMemberEvent");
		}
	}

(Ideally it should check also if timer is present, I'll look more into it).

I actually do not see a reason to check at random places if client is still in quorum, as there should be no way for us to miss that event.

Let me know what you think

connectionStateHandler_receivedJoinSignalEvent(connectionDetails);
connectionStateHandler_receivedJoinSignalEvent(connectionDetails2);

// It should not wait for leave of connectionDetails.clientId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... how do we know we can ignore that duplicate Join? We have no indication it was duplicate.

In this test, the Leave signal triggers this code, right?

this.applyForConnectedState("removeMemberEvent");

So then the concept is that we have cleared that old client, and when the duplicate Join comes in, even though its clientId matches ours, we ignore it, because who cares.

Hmmm maybe it's ok. Weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not "duplicate" from Audience POV - Audience is cleared before we process "initial signals" on a new connection. But yes, same ID "joins" audience again.

There is nothing to ignore here by ConnectionStateHandler - it simple is indifferent to that event.
It's not really "ours", I think that's the key. ConnectionStateHandler should make right calls on loss of connection (RE wait for leave or not), but after that it operates only with pendingClientId.

That said, yes, I agree that applyForConnectedState() overall looks a bit wrong (in terms of else branch).
It's not just applyForConnectedState("removeMemberEvent"), even as unrelated as applyForConnectedState("containerSaved") seems like can trigger "connectedStateRejected" event that looks wrong to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, containerSaved() cancels timeout before calling applyForConnectedState("containerSaved").
Same for receivedRemoveMemberEvent()
And we produce error event only for "timeout", so that looks Ok (i.e. "connectedStateRejected" event is information only, and can show up number of times for same connection).

// IN other words, if connectionStateHandler has access to Quorum early in load sequence, it will see events (in stashed ops mode)
// in the order that is not possible in real life, that it may not expect.
// Ideally, we should supply pendingLocalState?.clientId here as well, not in constructor, but it does not matter (at least today)
this.connectionStateHandler.initProtocol(this.protocolHandler);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, but wanted to think out loud about it too since this kind of move is pretty subtle.

I looked at all the code that runs between the old spot and new spot. The critical piece is obviously replaying stashed ops ("saved ops"), but it also will do ops fetch. Any concerns with delaying this past that point?

What I can think of is that other clients' join/leave ops could come in, but those never matter to ConnectionStateHandler, it is only concerned with its own.

And this function is all about readying the container to be connected, so it makes sense to go here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There is some risk involved here, and to be honest - I'd rather look (over time) at keeping this call where it was, but change when / how we communicate pendingLocalState.clientId to connectionStateHandler (make it know about such clientId much later in load sequence). I'll look more into it, maybe it's not that hard - not sure.

// we might wait even if we could avoid such wait.
if (
this._clientId !== undefined &&
protocol.quorum?.getMember(this._clientId) !== undefined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're checking quorum instead of membership (could be audience), because if this._clientId was a read connection we don't care -- right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrent. And it's equivalent to how we do it in main flow.

@vladsud vladsud merged commit 7cba8d9 into microsoft:main May 23, 2024
30 checks passed
@vladsud vladsud deleted the ReconnectBug branch May 23, 2024 20:24
Comment on lines +740 to +741
// This mimicks check in setConnectionState()
// Note that we are not consulting this.handler.shouldClientJoinWrite() here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This mimicks check in setConnectionState()
// Note that we are not consulting this.handler.shouldClientJoinWrite() here
// This mimics check in setConnectionState(), BUT we are not consulting this.handler.shouldClientJoinWrite() here

vladsud added a commit to vladsud/FluidFramework that referenced this pull request May 23, 2024
…kflow (microsoft#21110)

1. Add UT case that catches the problem (if no other changes are made).
2. Remove incorrect code that starts leave timer where it's not required (ConnectionStateHandler.ts)
3. Suggests possible fix (Container.ts) for stashed ops worfklow. It's a bit ugly. Happy to hear better ideas.

Overall note: I think that's not the only shoe to drop. Exposing clientId from previous client that early in boot sequence, and essentially making some events go backwards in time (addMember / removeMember events in Quorum fire after we have established clientId) may be problematic in other areas, including in consumers of FF. In other words, users of the system may not expect such reverse flow of events. In places where they run into it, they may not understand why and how it happens, and may produce incorrect changes to deal with them, or reduce invariant checks and thus reduce code integrity.
vladsud added a commit that referenced this pull request May 24, 2024
Port two ContainerStateHandler PRs to RC4:

#21223
#21110
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: loader Loader related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants