Our applications live in an imperfect world, communicate using unstable network and call unguaranteed resources. Lately, with the rise of service world, even more failure points appeared. The source code often contains more of insurance code than the actual business logic. I’d like to write about one of their kind – when the service call is unsuccessful.
The fail result might differ in nature:
Transient – For instance,
– 503 Service unavailable – when the service is overloaded or temporary disabled for maintenance;
– 504 Gateway Timeout – when proxy servers don’t get response from back servers in time;
– Also any timeout, when there is no response from server at all.
These are transient errors and might resolve after some time.
Permanent – Incorrect password error will never resolve with time.
When we determine that we are getting a transient error, we can start retrying every few seconds. Just one issue: If we hit timeout because of a server overload, our retries will increase the request queue even more and prevent service from recovering. There are few design patterns to tackle this problem:
You might have noticed, when you have GMail opened in the browser and internet goes off, the notification comes up “Connecting in 1s…”. At first it will retry in 1 second, then in 2 seconds, then in 4; then 8 and increases the delay time exponentially like that. Sometimes it even reaches hours.
The case is not only with the browser-server communication. This kind of problematic connections can be present totally on the server side – among different components. Sometimes ‘randomness’ is introduced for better performance. For instance, both methods are used in Amazon AWS architecture: Exponential Backoff and Jitter.
In randomness I mean that instead of having 4 seconds delay, there could be X seconds, where X is random number between 1 and 4.
I use these numbers for the sake of explanation. Clearly, we will need several constants:
Base delay time, maximal number of attempts and maximal delay time.
In a similar fashion, when the invocation is unsuccessful, we can make more and more delay instead of retrying every other second and let the service recover.
Exponential backoff algorithm is also used in the Ethernet protocol. When two machines on the same network try to send the packet simultaneously, collision happens. If they repeat the action after the same delay, they will collide again and again forever. Consequently, the delay formula roughly looks like this:
0 ≤ r < 2^k where k = min (n, 10)
n is the number of collisions and r is selected randomly between 0 and 2^k. k is number of collisions topped by 10. So, the more collisions happen, the more top limit is increased (exponentially). Then there is even less probability to get same delay time randomly.
This algorithm is very much like an electric circuit breaker, which we have at home. An intermediary object is placed (on client side) between the client and a server, which serves as a service protector. The requests are sent through this object. When it notices a high rate of failed responses, it will trip to an ‘Open’ state and won’t pass client requests, but rather respond to them itself with a failure.
In case of an electric one, we have to manually switch it back to the initial state, but we can’t do it here, so after some time interval, this object should automatically switch into a “Half-Open” state and let one request pass to the service. Based on the response, it will either stay in the “Open” state, or move to “Closed” one and will pass all requests.
Photo is taken from Martin Fowler’s blog
In some frameworks (especially while communicating with the database), these kinds of algorithms are already implemented.
If you have a public API and wish that unknown clients stick to a better mechanism of retry – i.e not to totally kill your service during troubles, then you can write client libraries yourself for several languages and users will use them instead.