Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run full GC only for auto recovery scenarios #21134

Merged
merged 6 commits into from
May 21, 2024

Conversation

agarwal-navin
Copy link
Contributor

@agarwal-navin agarwal-navin commented May 16, 2024

Currently, GC runs will fullGC a little often which also results in running summarize with fullTree which is expensive. This PR updates the logic to ensure that full GC / full tree summary will only happen if needed. It also removes the option to enable / disable GC mark phase via feature flags.
Here is the list of scenarios where full GC / full summary runs today and what is the change in this PR:

Scenario Before this change With this change
A container's base snapshot doesn't have any GC state and GC is enabled. Full GC and full tree summary would run. GC cannot be enabled for documents that had it disabled.
A container has GC enabled via metadata, but it is disabled via RunGC test config. Full GC and full tree summary would run. GC cannot be disabled for documents that had it enabled.
A container has GC disabled via metadata, but it is enabled via RunGC test config. Full GC and full tree summary would run. GC cannot be disabled for documents that had it enabled.
Base snapshot's GC version is newer than current GC version Full GC and full tree summary would run. GC data will be re-generated. Summary will be regenerated only for nodes whose reference state changed.
Base snapshot's GC version is older than current GC version Full GC and full tree summary would run. GC would be disabled. GC data will be re-generated. Summary will be regenerated only for nodes whose reference state changed.
Auto-recovery for tombstone errors Full GC and full tree summary would run. GC data will be re-generated. Summary will be regenerated only for nodes whose reference state changed.

AB#8037

@github-actions github-actions bot added area: runtime Runtime related issues public api change Changes to a public API base: main PRs targeted against main branch labels May 16, 2024
@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label May 20, 2024
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented May 20, 2024

@fluid-example/bundle-size-tests: -3.41 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 453.96 KB 453.11 KB -872 Bytes
azureClient.js 551.21 KB 550.36 KB -872 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 257.82 KB 256.97 KB -871 Bytes
fluidFramework.js 357 KB 357 KB No change
loader.js 132.91 KB 132.91 KB No change
map.js 41.43 KB 41.43 KB No change
matrix.js 143.66 KB 143.66 KB No change
odspClient.js 519.75 KB 518.9 KB -872 Bytes
odspDriver.js 97.3 KB 97.3 KB No change
odspPrefetchSnapshot.js 42.16 KB 42.16 KB No change
sharedString.js 160.18 KB 160.18 KB No change
sharedTree.js 356.98 KB 356.98 KB No change
Total Size 3.19 MB 3.19 MB -3.41 KB

Baseline commit: 7dcb473

Generated by 🚫 dangerJS against c6e105b

@@ -374,7 +371,7 @@ export class GarbageCollector implements IGarbageCollector {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think initializeOrUpdateGCState should early return if it's not enabled, to avoid all the logging on "connected"

(this comment is pinned to a random line, not related to this point)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializeOrUpdateGCState is called from setConnectionState which already has this check.

@@ -855,7 +849,7 @@ export class GarbageCollector implements IGarbageCollector {
trackState: boolean,
telemetryContext?: ITelemetryContext,
): ISummarizeResult | undefined {
if (!this.configs.shouldRunGC || this.gcDataFromLastRun === undefined) {
if (!this.shouldRunGC || this.gcDataFromLastRun === undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine (more future-proof) but I think we only need to check this.gcDataFromLastRun, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@@ -515,7 +513,7 @@ export class SummarizerNodeWithGC extends SummarizerNode implements IRootSummari
* was previously used and became unused (or vice versa), its used state has changed.
*/
private hasUsedStateChanged(): boolean {
// If GC is disabled, we are not tracking used state, return false.
// If GC is disabled, it should not affect summary state, return false.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this got back to always false.

assert(gc.configs.sweepEnabled, "sweepEnabled incorrect");
assert.equal(gc.configs.shouldRunSweep, "NO", "shouldRunSweep incorrect");
assert(
gc.configs.sessionExpiryTimeoutMs === undefined,
"sessionExpiryTimeoutMs incorrect",
);
assert(gc.configs.tombstoneTimeoutMs === undefined, "tombstoneTimeoutMs incorrect");
assert.equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to add an assert on gc.configs.gcVersionInBaseSnapshot instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to this and other places where I removed this check.

);
});

it("shouldRunGC should be false when gcVersionInEffect is older than gcVersionInBaseSnapshot", () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we leave this and just flip false to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Added it back.

@@ -762,72 +682,72 @@ describe("Garbage Collection configurations", () => {
});
describe("shouldRunSweep", () => {
const testCases: {
shouldRunGC: boolean;
gcEnabled: boolean;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to gcEnabled_doc

sweepEnabled_doc: true,
sweepEnabled_session: true,
disableDataStoreSweep: "viaConfigProvider",
expectedShouldRunSweep: "ONLY_BLOBS",
},
{
shouldRunGC: true,
gcEnabled: true,
sweepEnabled_doc: true,
sweepEnabled_session: true,
shouldRunSweep: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind renaming this to runSweepOverride or runSweep_config? It's very confusing that this input is called the same name as the output property being examined.

@@ -22,164 +22,13 @@ type GCSummaryStateTrackerWithPrivates = Omit<GCSummaryStateTracker, "latestSumm
};

describe("GCSummaryStateTracker tests", () => {
describe("Summary state reset", () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this red. Really nice that the summarizer node hasChanged logic is sufficient for all these cases.

@@ -63,6 +64,7 @@ describeCompat("GC version update", "NoCompat", (getTestObjectProvider, apis) =>
let dataStore1Id: string;
let dataStore2Id: string;
let dataStore3Id: string;
let baseGCDetailsSpy: Sinon.SinonSpy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

const baseGCDetails: IGarbageCollectionDetailsBase = {
gcData: {
gcNodes: {},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explicitly say usedRoutes: undefined

*/
async function summarizeAndValidateGCStateReset(summarizer: ISummarizer) {
const containerRuntime = (summarizer as any).runtime as IContainerRuntimeWithPrivates;
const spy = sandbox.spy(containerRuntime.garbageCollector, "getBaseGCDetails");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the sandbox object do exactly? Just lets you spy on stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. That among other things. I just copied the pattern from other tests.

dataStoresAsHandles,
true /* gcEnabled */,
);
// Validate that the GC state is reset in the base GC details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you talking about this. Let me make sure I got it -

The tests used to validate that no handles were used, because fullGC/fullTree summary was expected. Now we certainly don't force fullGC/fullTree, trusting that any node with a change to used routes will be resummarized based on summarizerNodeWithGc.hasChanged.

Now since we don't force all nodes to regenerate summary tree, you're asserting the more direct outcome that the GC state was reset in the new Summary (by forcing it to summarize to trigger GC init and then checking was baseGCDetails in GC was).

Is it true that you also could assert that whichever dataStores had usedRoutes before would be Trees and the rest would be Handles? Not suggesting it, just checking my understanding.

Please correct me where I'm wrong!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Basically, there are 2 main points that ensure right things will happen:

  1. GC must re-run - For this, base GC state should be initialized as empty. This means all nodes start with empty GC data. So, next time GC runs, it will have to regenerate the GC data.
  2. Summary for nodes with incorrect reference state must be regenerated - After step 1 above, GC data is regenerated. If any node has its used routes (aka reference state) changed, it's summary will be regenerated because hasChanged will return true.

The tests validate # 1 above and # 2 is validated in unit tests. It's very hard to validate # 2 here because it will require changing the GC data generated by a node between 2 summaries without actually sending any ops (or that will trigger regeneration) because that's the real world scenario we are targeting. I could hack around and do it but it's not necessary.

@@ -98,14 +100,27 @@ describeCompat("GC version update", "NoCompat", (getTestObjectProvider, apis) =>
return summaryResult.summaryVersion;
}

/**
* Generates a summary and validates that the GC state is reset (empty) in the base GC details. This will ensure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the base GC details

This is about the snapshot loaded from, not the result of the summary here, right? The summarize just serves to ensure GC is initailized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated the comment to make that clear.

@agarwal-navin agarwal-navin requested review from tyler-cai-microsoft and a team May 21, 2024 18:15
@agarwal-navin agarwal-navin merged commit cc3ac0a into microsoft:main May 21, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants