~/codewithstu

3 Ways to Increase the Reliability of Your Applications with Polly

Transcript

In this video we're going to take a look at the different ways you can make your .NET applications more stable by using three different patterns. In order to create highly resilient applications, we must embrace the fact that applications will fail, often at inopportune times. Failures can come in many forms such as temporary loss of service, complete service failure, or timeouts. It's up to us as developers to decide how we're going to respond to each and every service failure and which patterns we're going to use to do this.

The first pattern we're going to look at is retries with decorated jitter. When we write a retry policy, we generally write it anywhere there is a network operation such as writing to a remote file system or calling a third-party API. When we retry, we generally have a period of time in which we back off to allow external systems to recover. If we have a high amount of concurrent operations and we all back off and retry at the same time, potentially overloading the system again. To counteract this, we add some randomness to the retry delays, which is known as jitter. Jitter has been shown to massively decrease the total operation duration in a failure scenario.

Our library of choice when implementing retry policies is to use Polly. The team behind Polly and its many contributors have placed a lot of effort into finding an efficient way of using retry policies with jitter. Without jitter, our retry policies will be correlated, something like the diagram. Now as you can see, with jitter the total time wasted and the effort wasted is greatly reduced and our service is overall more productive.

In the sample application here, we simulate multiple threads running and what happens when we have different versions of retry. To start off with, we have no jitter in place and just a standard exponential retry. As you can see, everything starts and finishes in bulk.

Now let's define a new method so that we can add jitter to our retry policy. We're going to use the decorated back off v2 from the backoff class on the Polly.Contrib library. The first parameter that we supply is the median delay to target before the first retry. The second parameter is how many times in total we want to retry. Finally, we define which policy that we want to handle and then call wait and retry, passing in the new delay parameters. Here I'm supplying an additional overload which just allows us to perform an action when we are retrying. Then we just need to quickly change the policy that we're working with at the top and we can rerun our program. As we scroll back through the program, you can see that we no longer perform the operations in big blocks. It's now more spread out over time. This is a massive improvement over what we previously had for the service without any jitter.

The next pattern that we're going to take a look at is the bulkhead pattern. A bulkhead is used to control access to a common resource by multiple threads to avoid overloading it and causing cascading failures within a system. It does this by placing limits on what a system can process with a fixed length queue of pending requests. Once the pending request queue is full, then any subsequent request is rejected and returned to the caller. This helps leave system resources for requests that the system can actually process and doesn't overload system resources such as CPU and memory.

We can find the maximum concurrent requests through stress testing our target service so we know exactly which point the system begins to fail. To figure out the number of pending requests, I usually use two to four times the maximum load as a starting point and then run a series of different load tests against the system to see how it reacts and responds. I then adjust this value accordingly. A bulkhead can be placed either on the client side or the server side, although it is more common to see it on the server side as that's where the expensive computation usually occurs.

Now let's take a look at how we can implement a bulkhead method. We define our method with two properties: capacity and the maximum queue length. We then look on the Policy class for the BulkheadAsync method. In this method, we pass in the capacity and the queue length. We have the optional overload of doing something with the context, which we're simply going to use just to pass out a rejected call message. Once we use our new method at the top and pass in the correct parameters, we can then execute our application. As you can see from the screen, we have a few rejected call messages appearing, meaning that we have reached the limits of this system.

The next pattern we're going to take a look at is the circuit breaker pattern. A circuit breaker detects the amount of faults in calls placed through it and prevents calls when a configurable fault threshold is exceeded. For example, you are calling an API that is continuously returning 500 status code results because of a failure condition. In this case, a circuit breaker would trigger and prevent the calls from being forwarded to that service, giving it the opportunity to recover. Automatically, after a period of time, the circuit breaker would reopen and allow calls back to the hopefully recovered service.

A circuit breaker has three states: closed, open, and half open. In the open state, no requests are forwarded to the target system as the circuit breaker has detected it's in an unhealthy state. In the closed state, requests are forwarded to the target system as normal. Whilst in the half open state, each request is treated as an experiment to see whether or not the target system has recovered, and places the circuit breaker into the open or closed state depending on the result of the operation.

Polly makes implementing this pattern really easy by providing us an advanced circuit breaker class. This is an improvement on the original circuit breaker implementation they had, as it now reacts on a proportion of failures which is measured over a duration of time. It also ensures that we have a minimum amount of throughput before starting its monitor for failure.

So to implement the advanced circuit breaker pattern, we're going to define a new policy. We're going to declare two variables first: a sample period to say how long the circuit breaker is going to monitor, and the reset period which is how long after the circuit breaker has opened does the sampling reset. Then we call Policy.

Handle, handling all exceptions, followed by the advanced circuit breaker extension method. We pass in the percentage of errors after which the circuit breaker will close, in this case 50% or 0.5. Then the sample period, minimum throughput, and the reset period. We're also going to specify two additional overloads. The first just tells us when something is broken and we can perform an action, and the second is when the circuit breaker resets.

If we go ahead and use the circuit breaker policy and run the application, we'll be able to see that the circuit breaker breaks a couple of times before resetting.

Whilst all of the policies that we've seen here today can be game changers for our applications, if we combine them then they become even more powerful. I typically implement both the decorated jitter and the bulkhead together to constrain the resources used by any given service. To do this, Polly offers a simple extension method called Wrap. This takes multiple policies and executes them in the order specified as a brand new policy.

If you enjoyed this video, consider subscribing to the YouTube channel for more content like this.

// share_this