[Question] Drop in performance after updating to 9.1.1799.9590 #1441

flower7434 · 2023-06-14T06:52:14Z

After updating from 9.0.1028.9590 to 9.1.1799.9590 we have seen a huge drop in performance with persisted actors located on data disks. IOPS usage increased with about 10 times. A rollback solved the problem. We have seen this on more than one cluster.

Any suggestions on how to solve this?

flower7434 · 2023-06-20T11:37:38Z

This seems to have started in version 9.0.1048.9590. The comment for the Failover Manager cache bug fix sounds suspicious.

Fix: Add cleanup logic that purges all stale entries keeping the load cache small and predictable. https://github.com/microsoft/service-fabric/blob/master/release_notes/Service_Fabric_ReleaseNotes_90CU2.md

flower7434 · 2023-10-18T17:04:56Z

So it seems like the last working version of Service Fabric in Azure is about to expire. No ideas on this? Before upgrade, after upgrade and after rollback. It breaks the cluster completely.

mfmadsen · 2023-10-19T23:18:13Z

We use this for production and have not experienced issues like this as far as I know. I will do some investigation to double check though. Might be that we just haven’t noticed although based on your graphs it seems like something we would notice as performance seems to take quite a hit.

mfmadsen · 2023-10-19T23:20:49Z

Alright, we are using version 9.1.1583.9590 in production. Our dev environment is using 10.x so will investigate further.

mfmadsen · 2023-10-19T23:23:41Z

@FredrikDahlberg do you know whether this is reproducible on LO a dev cluster?
Are you running a Windows or Linux cluster?

mfmadsen · 2023-10-20T00:22:38Z

Hi again @FredrikDahlberg. We are not using data disks, so this issue most likely is not impacting us.

flemmade · 2023-11-21T12:46:54Z

Hi, sorry if this comment comes too late (and also typing this from a phone so there might be some typos and ugly formatting)

We had a similar issue when migrating our cluster from version 9.0.1017.9590 to 9.1.1833.9590: metrics on the cluster skyrocketted (CPU/Disk use)
It was only observed in services where we used SF actors.

After a lot of exchanges with Microsoft, we finally were about to pinpoint that they had added a way to configure the actors' defragmentation frequency... And incidentally drastically changed the default value on the way.

So it went from once per day to once every 30 minutes, without stating the huge change in the doc.
https://github.com/microsoft/service-fabric/blob/master/release_notes/Service_Fabric_ReleaseNotes_91CU3.md#service-fabric-feature-and-bug-fixes

Adding a new parameter MaxDefragFrequencyInMinutes in our actors' Settings.xml under a <MyActorServiceLocalConfigStore> section, and setting its value to 1440 (value for versions < 9.1) solved the problem for us.

The symptoms are not exactly the same so it might be something else, but hopefully it can lead you or somebody else encountering the issue towards a solution.

flower7434 added the type-code-defect Something isn't working label Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Drop in performance after updating to 9.1.1799.9590 #1441

[Question] Drop in performance after updating to 9.1.1799.9590 #1441

flower7434 commented Jun 14, 2023

flower7434 commented Jun 20, 2023

flower7434 commented Oct 18, 2023

mfmadsen commented Oct 19, 2023

mfmadsen commented Oct 19, 2023

mfmadsen commented Oct 19, 2023

mfmadsen commented Oct 20, 2023

flemmade commented Nov 21, 2023 •

edited

[Question] Drop in performance after updating to 9.1.1799.9590 #1441

[Question] Drop in performance after updating to 9.1.1799.9590 #1441

Comments

flower7434 commented Jun 14, 2023

flower7434 commented Jun 20, 2023

flower7434 commented Oct 18, 2023

mfmadsen commented Oct 19, 2023

mfmadsen commented Oct 19, 2023

mfmadsen commented Oct 19, 2023

mfmadsen commented Oct 20, 2023

flemmade commented Nov 21, 2023 • edited

flemmade commented Nov 21, 2023 •

edited