Recommendations to Improve Resiliency during an Availability Zone Outage #379

markti · 2024-05-14T14:19:59Z

In our efforts to test Azure Availability Zone resiliency failure modes, we utilized the .NET reliable web application's codebase. We modified the solution to host it on Azure Kubernetes Service (AKS) while maintaining the same .NET code framework but added enhanced telemetry to provide deeper insights into how and where requests failed. The testing involved both Redis and SQL Database, and we extended our evaluation to include SQL Managed Instance and Cosmos DB. This document details the findings related to the .NET code implementation and the design changes we made during this process, offering insights into improving the resilience and reliability of the application.

Redis Cache

The ‘AddStackExchangeRedisCache’ extension method configures the IConnectionMultiplexer so that it can be injected with the application uses the IDistributedCache. We have a repository component that is responsible for saving and retrieving cart items from this Distributed Cache.

In order for the DistributedCache to be connected to Redis we need this extension method executed during Startup’s ConfigureServices stage.

The IConnectionMultiplexer is created using a static Connect method which takes a ConfigurationOptions object which contains the Redis connection configuration.

We currently have AbortOnConnectFail set to FALSE.

There are two types of Retry Policies that we can assign: Linear and Expontential.

We also have four health events we can capture from the Redis Connection:

Connection Failed
Connection Restored
Error Message
Service Maintenance Event

These additional events help detect and respond to Redis operation failures that occur asynchronously.

SQL Server

The built-in capabilities of the retry policy provide a good starting point by supporting exponential backoff strategy. However, making use of an additional framework, Polly, to introduce additional features such as jitter can further enhance the solution. However, this requires additional development to create explicit retry mechanisms on all Redis caching operations.
SQL Server

The Entity Framework constructs the connection to the SQL Server database using a retry policy that has two parameters: Maximum Retry Count and Maximum Retry Delay. This is a linear retry mechanism. Consider using Polly to introduce exponential backoff strategy and jitter to the retry policy.

Design Guidance
Retry mechanisms should be put in place where it is appropriate. Certain types of transactions should not be retried. These types include client errors especially authentication and authorization failures.

To avoid Thundering Herd situations, the utilization of exponential backoff and jitter within retry policies is important to prevent resource exhaustion. It’s also important to set a maximum number of retries. This should be set based on expected service level agreements of the underlying infrastructure. These retry policies need to be synchronized between Client and Server response times in order to avoid increased throughput and reduced throughput.

Inconsistent Backoff Strategies
If the server is using exponential backoff strategy in conjunction with jitter while the client is using a constant backoff interval, during an outage many of the clients might end up retrying in a synchronized fashion thus creating unpredictable spikes of load on the server. This can result in reduced throughput or even overwhelm the system via a thundering herd situation.

Misaligned Retry Policy
If the client is configured to retry 10 times but the server is configured to retry 3 times you can clearly see a situation where the server has already given up while the client will keep attempting to retry. This will reduce overall throughput by wasting unnecessary system resources and potentially lead to service failures through resource exhaustion.

In conclusion, the .NET reliable web application should consider introducing Polly to add exponential backoff and jitter to its retry mechanisms. Additionally, it is essential to evaluate and synchronize retry thresholds and limits between the client and server components of the architecture. This approach will prevent Thundering Herd situations, ensure resource efficiency, and maintain more consistent throughput during Zone Outages, thereby enhancing the overall resilience and performance of the application.

@rspott @dave-read @Jerryp11 @gitforkash @shashwatchandra 👀

markti added the enhancement New feature or request label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations to Improve Resiliency during an Availability Zone Outage #379

Recommendations to Improve Resiliency during an Availability Zone Outage #379

markti commented May 14, 2024

Recommendations to Improve Resiliency during an Availability Zone Outage #379

Recommendations to Improve Resiliency during an Availability Zone Outage #379

Comments

markti commented May 14, 2024