Run full GC only for auto recovery scenarios #21134

agarwal-navin · 2024-05-16T23:38:06Z

Currently, GC runs will fullGC a little often which also results in running summarize with fullTree which is expensive. This PR updates the logic to ensure that full GC / full tree summary will only happen if needed. It also removes the option to enable / disable GC mark phase via feature flags.
Here is the list of scenarios where full GC / full summary runs today and what is the change in this PR:

Scenario	Before this change	With this change
A container's base snapshot doesn't have any GC state and GC is enabled.	Full GC and full tree summary would run.	GC cannot be enabled for documents that had it disabled.
A container has GC enabled via metadata, but it is disabled via RunGC test config.	Full GC and full tree summary would run.	GC cannot be disabled for documents that had it enabled.
A container has GC disabled via metadata, but it is enabled via RunGC test config.	Full GC and full tree summary would run.	GC cannot be disabled for documents that had it enabled.
Base snapshot's GC version is newer than current GC version	Full GC and full tree summary would run.	GC data will be re-generated. Summary will be regenerated only for nodes whose reference state changed.
Base snapshot's GC version is older than current GC version	Full GC and full tree summary would run. GC would be disabled.	GC data will be re-generated. Summary will be regenerated only for nodes whose reference state changed.
Auto-recovery for tombstone errors	Full GC and full tree summary would run.	GC data will be re-generated. Summary will be regenerated only for nodes whose reference state changed.

AB#8037

packages/runtime/container-runtime/src/gc/gcSummaryStateTracker.ts

packages/runtime/container-runtime/src/gc/garbageCollection.ts

msfluid-bot · 2024-05-20T22:56:22Z

⯆ @fluid-example/bundle-size-tests: -3.41 KB

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	453.96 KB	453.11 KB	⯆ -872 Bytes
azureClient.js	551.21 KB	550.36 KB	⯆ -872 Bytes
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	257.82 KB	256.97 KB	⯆ -871 Bytes
fluidFramework.js	357 KB	357 KB	■ No change
loader.js	132.91 KB	132.91 KB	■ No change
map.js	41.43 KB	41.43 KB	■ No change
matrix.js	143.66 KB	143.66 KB	■ No change
odspClient.js	519.75 KB	518.9 KB	⯆ -872 Bytes
odspDriver.js	97.3 KB	97.3 KB	■ No change
odspPrefetchSnapshot.js	42.16 KB	42.16 KB	■ No change
sharedString.js	160.18 KB	160.18 KB	■ No change
sharedTree.js	356.98 KB	356.98 KB	■ No change
Total Size	3.19 MB	3.19 MB	⯆ -3.41 KB

Baseline commit: 7dcb473

Generated by 🚫 dangerJS against c6e105b

packages/runtime/container-runtime/src/gc/garbageCollection.ts

markfields · 2024-05-21T03:33:00Z

packages/runtime/container-runtime/src/gc/garbageCollection.ts

@@ -374,7 +371,7 @@ export class GarbageCollector implements IGarbageCollector {
 return;


I think initializeOrUpdateGCState should early return if it's not enabled, to avoid all the logging on "connected"

(this comment is pinned to a random line, not related to this point)

initializeOrUpdateGCState is called from setConnectionState which already has this check.

markfields · 2024-05-21T03:34:37Z

packages/runtime/container-runtime/src/gc/garbageCollection.ts

@@ -855,7 +849,7 @@ export class GarbageCollector implements IGarbageCollector {
 trackState: boolean,
 telemetryContext?: ITelemetryContext,
 ): ISummarizeResult | undefined {
- if (!this.configs.shouldRunGC || this.gcDataFromLastRun === undefined) {
+ if (!this.shouldRunGC || this.gcDataFromLastRun === undefined) {


This is fine (more future-proof) but I think we only need to check this.gcDataFromLastRun, right?

Yes, you are right.

markfields · 2024-05-21T03:38:20Z

packages/runtime/container-runtime/src/summary/summarizerNode/summarizerNodeWithGc.ts

@@ -515,7 +513,7 @@ export class SummarizerNodeWithGC extends SummarizerNode implements IRootSummari
 * was previously used and became unused (or vice versa), its used state has changed.
 */
 private hasUsedStateChanged(): boolean {
- // If GC is disabled, we are not tracking used state, return false.
+ // If GC is disabled, it should not affect summary state, return false.


Nice, this got back to always false.

markfields · 2024-05-21T03:44:30Z

packages/runtime/container-runtime/src/test/gc/gcConfigs.spec.ts

 assert(gc.configs.sweepEnabled, "sweepEnabled incorrect");
 assert.equal(gc.configs.shouldRunSweep, "NO", "shouldRunSweep incorrect");
 assert(
 gc.configs.sessionExpiryTimeoutMs === undefined,
 "sessionExpiryTimeoutMs incorrect",
 );
 assert(gc.configs.tombstoneTimeoutMs === undefined, "tombstoneTimeoutMs incorrect");
- assert.equal(


Want to add an assert on gc.configs.gcVersionInBaseSnapshot instead?

Added to this and other places where I removed this check.

markfields · 2024-05-21T03:45:52Z

packages/runtime/container-runtime/src/test/gc/gcConfigs.spec.ts

- );
- });
-
- it("shouldRunGC should be false when gcVersionInEffect is older than gcVersionInBaseSnapshot", () => {


Shouldn't we leave this and just flip false to true?

You are right. Added it back.

markfields · 2024-05-21T03:46:09Z

packages/runtime/container-runtime/src/test/gc/gcConfigs.spec.ts

@@ -762,72 +682,72 @@ describe("Garbage Collection configurations", () => {
 });
 describe("shouldRunSweep", () => {
 const testCases: {
- shouldRunGC: boolean;
+ gcEnabled: boolean;


nit: rename to gcEnabled_doc

markfields · 2024-05-21T03:48:42Z

packages/runtime/container-runtime/src/test/gc/gcConfigs.spec.ts

 sweepEnabled_doc: true,
 sweepEnabled_session: true,
 disableDataStoreSweep: "viaConfigProvider",
 expectedShouldRunSweep: "ONLY_BLOBS",
 },
 {
- shouldRunGC: true,
+ gcEnabled: true,
 sweepEnabled_doc: true,
 sweepEnabled_session: true,
 shouldRunSweep: true,


Would you mind renaming this to runSweepOverride or runSweep_config? It's very confusing that this input is called the same name as the output property being examined.

markfields · 2024-05-21T03:50:34Z

packages/runtime/container-runtime/src/test/gc/gcSummaryStateTracker.spec.ts

@@ -22,164 +22,13 @@ type GCSummaryStateTrackerWithPrivates = Omit<GCSummaryStateTracker, "latestSumm
 };

 describe("GCSummaryStateTracker tests", () => {
- describe("Summary state reset", () => {


Love this red. Really nice that the summarizer node hasChanged logic is sufficient for all these cases.

markfields · 2024-05-21T03:54:13Z

packages/test/test-end-to-end-tests/src/test/gc/gcVersionUpdate.spec.ts

@@ -63,6 +64,7 @@ describeCompat("GC version update", "NoCompat", (getTestObjectProvider, apis) =>
 let dataStore1Id: string;
 let dataStore2Id: string;
 let dataStore3Id: string;
+ let baseGCDetailsSpy: Sinon.SinonSpy;


markfields · 2024-05-21T03:58:03Z

packages/runtime/container-runtime/src/test/summarizerNodeWithGc.spec.ts

+ const baseGCDetails: IGarbageCollectionDetailsBase = {
+ gcData: {
+ gcNodes: {},
+ },


maybe explicitly say usedRoutes: undefined

markfields · 2024-05-21T03:59:59Z

packages/test/test-end-to-end-tests/src/test/gc/gcVersionUpdate.spec.ts

+ */
+ async function summarizeAndValidateGCStateReset(summarizer: ISummarizer) {
+ const containerRuntime = (summarizer as any).runtime as IContainerRuntimeWithPrivates;
+ const spy = sandbox.spy(containerRuntime.garbageCollector, "getBaseGCDetails");


What does the sandbox object do exactly? Just lets you spy on stuff?

I think so. That among other things. I just copied the pattern from other tests.

markfields · 2024-05-21T04:09:13Z

packages/test/test-end-to-end-tests/src/test/gc/gcVersionUpdate.spec.ts

- dataStoresAsHandles,
- true /* gcEnabled */,
- );
+ // Validate that the GC state is reset in the base GC details.


I remember you talking about this. Let me make sure I got it -

The tests used to validate that no handles were used, because fullGC/fullTree summary was expected. Now we certainly don't force fullGC/fullTree, trusting that any node with a change to used routes will be resummarized based on summarizerNodeWithGc.hasChanged.

Now since we don't force all nodes to regenerate summary tree, you're asserting the more direct outcome that the GC state was reset in the new Summary (by forcing it to summarize to trigger GC init and then checking was baseGCDetails in GC was).

Is it true that you also could assert that whichever dataStores had usedRoutes before would be Trees and the rest would be Handles? Not suggesting it, just checking my understanding.

Please correct me where I'm wrong!

Yes, that is correct. Basically, there are 2 main points that ensure right things will happen:

GC must re-run - For this, base GC state should be initialized as empty. This means all nodes start with empty GC data. So, next time GC runs, it will have to regenerate the GC data.

Summary for nodes with incorrect reference state must be regenerated - After step 1 above, GC data is regenerated. If any node has its used routes (aka reference state) changed, it's summary will be regenerated because hasChanged will return true.

The tests validate # 1 above and # 2 is validated in unit tests. It's very hard to validate # 2 here because it will require changing the GC data generated by a node between 2 summaries without actually sending any ops (or that will trigger regeneration) because that's the real world scenario we are targeting. I could hack around and do it but it's not necessary.

markfields · 2024-05-21T04:10:06Z

packages/test/test-end-to-end-tests/src/test/gc/gcVersionUpdate.spec.ts

@@ -98,14 +100,27 @@ describeCompat("GC version update", "NoCompat", (getTestObjectProvider, apis) =>
 return summaryResult.summaryVersion;
 }

+ /**
+ * Generates a summary and validates that the GC state is reset (empty) in the base GC details. This will ensure


in the base GC details

This is about the snapshot loaded from, not the result of the summary here, right? The summarize just serves to ensure GC is initailized?

Yes, updated the comment to make that clear.

agarwal-navin requested a review from markfields May 16, 2024 23:38

github-actions bot added area: runtime Runtime related issues public api change Changes to a public API base: main PRs targeted against main branch labels May 16, 2024

markfields reviewed May 17, 2024

View reviewed changes

packages/runtime/container-runtime/src/gc/gcSummaryStateTracker.ts Show resolved Hide resolved

markfields reviewed May 17, 2024

View reviewed changes

packages/runtime/container-runtime/src/gc/garbageCollection.ts Show resolved Hide resolved

agarwal-navin commented May 17, 2024

View reviewed changes

packages/runtime/container-runtime/src/gc/garbageCollection.ts Show resolved Hide resolved

agarwal-navin added 3 commits May 17, 2024 22:40

Run full GC only for auto recovery scenarios

fe4c4ec

Handle gc disabled scenario

ad22c0e

Summarizer node tests

d95d791

agarwal-navin requested review from pragya91 and a team May 17, 2024 22:56

agarwal-navin force-pushed the fixFullGC branch from a114dd5 to d95d791 Compare May 17, 2024 23:19

Update end-to-end test

7423aa2

github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label May 20, 2024

markfields reviewed May 20, 2024

View reviewed changes

packages/runtime/container-runtime/src/gc/garbageCollection.ts Outdated Show resolved Hide resolved

markfields reviewed May 20, 2024

View reviewed changes

packages/runtime/container-runtime/src/gc/garbageCollection.ts Outdated Show resolved Hide resolved

Remove option to disable GC

59ab6db

markfields reviewed May 21, 2024

View reviewed changes

markfields approved these changes May 21, 2024

View reviewed changes

PR comments 3

c6e105b

agarwal-navin requested review from tyler-cai-microsoft and a team May 21, 2024 18:15

agarwal-navin merged commit cc3ac0a into microsoft:main May 21, 2024
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run full GC only for auto recovery scenarios #21134

Run full GC only for auto recovery scenarios #21134

agarwal-navin commented May 16, 2024 •

edited

msfluid-bot commented May 20, 2024 •

edited

markfields May 21, 2024 •

edited

agarwal-navin May 21, 2024

markfields May 21, 2024

agarwal-navin May 21, 2024

markfields May 21, 2024

markfields May 21, 2024

agarwal-navin May 21, 2024

markfields May 21, 2024

agarwal-navin May 21, 2024

markfields May 21, 2024

markfields May 21, 2024

markfields May 21, 2024

markfields May 21, 2024

markfields May 21, 2024

markfields May 21, 2024

agarwal-navin May 21, 2024

markfields May 21, 2024

agarwal-navin May 21, 2024

markfields May 21, 2024

agarwal-navin May 21, 2024

		@@ -374,7 +371,7 @@ export class GarbageCollector implements IGarbageCollector {
		return;

Run full GC only for auto recovery scenarios #21134

Run full GC only for auto recovery scenarios #21134

Conversation

agarwal-navin commented May 16, 2024 • edited

msfluid-bot commented May 20, 2024 • edited

markfields May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarwal-navin commented May 16, 2024 •

edited

msfluid-bot commented May 20, 2024 •

edited

markfields May 21, 2024 •

edited