To retry or not to retry – Exponential backoff, Cirtcuit breaker

Our applications live in an imperfect world, communicate using unstable network and call unguaranteed resources. Lately, with the rise of service world, even more failure points appeared. The source code often contains more of insurance code than the actual business logic. I’d like to write about one of their kind – when the service call is unsuccessful.

The fail result might differ in nature:
Transient – For instance,
– 503 Service unavailable – when the service is overloaded or temporary disabled for maintenance;
– 504 Gateway Timeout – when proxy servers don’t get response from back servers in time;
– Also any timeout, when there is no response from server at all.
These are transient errors and might resolve after some time.

Permanent – Incorrect password error will never resolve with time.

When we determine that we are getting a transient error, we can start retrying every few seconds. Just one issue: If we hit timeout because of a server overload, our retries will increase the request queue even more and prevent service from recovering. There are few design patterns to tackle this problem:


Exponential backoff

You might have noticed, when you have GMail opened in the browser and internet goes off, the notification comes up “Connecting in 1s…”. At first it will retry in 1 second, then in 2 seconds, then in 4; then 8 and increases the delay time exponentially like that. Sometimes it even reaches hours.
The case is not only with the browser-server communication. This kind of problematic connections can be present totally on the server side – among different components. Sometimes ‘randomness’ is introduced for better performance. For instance, both methods are used in Amazon AWS architecture: Exponential Backoff and Jitter.
In randomness I mean that instead of having 4 seconds delay, there could be X seconds, where X is random number between 1 and 4.

I use these numbers for the sake of explanation. Clearly, we will need several constants:
Base delay time, maximal number of attempts and maximal delay time.

In a similar fashion, when the invocation is unsuccessful, we can make more and more delay instead of retrying every other second and let the service recover.

Exponential backoff algorithm is also used in the Ethernet protocol. When two machines on the same network try to send the packet simultaneously, collision happens. If they repeat the action after the same delay, they will collide again and again forever. Consequently, the delay formula roughly looks like this:
0 ≤ r < 2^k where k = min (n, 10)
n is the number of collisions and r is selected randomly between 0 and 2^k. k is number of collisions topped by 10. So, the more collisions happen, the more top limit is increased (exponentially). Then there is even less probability to get same delay time randomly.


Circuit breaker

This algorithm is very much like an electric circuit breaker, which we have at home. An intermediary object is placed (on client side) between the client and a server, which serves as a service protector. The requests are sent through this object. When it notices a high rate of failed responses, it will trip to an ‘Open’ state and won’t pass client requests, but rather respond to them itself with a failure.

In case of an electric one, we have to manually switch it back to the initial state, but we can’t do it here, so after some time interval, this object should automatically switch into a “Half-Open” state and let one request pass to the service. Based on the response, it will either stay in the “Open” state, or move to “Closed” one and will pass all requests.

After several timeout results, the protector will trip and start replying to the client

Photo is taken from Martin Fowler’s blog

In some frameworks (especially while communicating with the database), these kinds of algorithms are already implemented.

If you have a public API and wish that unknown clients stick to a better mechanism of retry – i.e not to totally kill your service during troubles, then you can write client libraries yourself for several languages and users will use them instead.

Memory optimization with multidimensional arrays in Java

Yesterday I hit a memory limit while submitting a problem solution to an online judge. All because I used a wrong declaration of a multidimensional array. In case of C or C++ language this would not be a problem, but due to the fact that in Java everything is an object (despite primitive types), we should calculate allocated memory differently for multidimensional arrays.

For instance, in C++ the size of this array is 96 bytes (assuming we have 4-byte int):
int arr[2][3][4];
2 * 3 * 4 = 24. Plus 4 bytes for each due to int type and we get total of 24 * 4 = 96.
If we declare the array like this, result would be the same: int arr[4][3][2]

Let’s compare these two declarations in Java:
int arr[][][] = new int[40000][200][2];
int arr[][][] = new int[2][200][40000];

There is not much different from the data storage viewpoint, however the first one requires much more memory. Here is why:
1. Array in Java is an object itself. Every object (not near the reference, but in the heap) holds few additional bytes for the object header information. This header contains data necessary for the VM, which later uses it for the garbage collector or some other purposes. As far as I know, generally it takes 8 bytes in case of a 32-bit machine and 16 bytes in case of a 64-bit one. Apart from this, an object holds information about the array size – that’s additional 4 bytes. Also padding in memory might take few more bytes. So, I won’t start precise calculations and let’s assume that each array object (excluding array elements) takes X bytes.

2. int[a][b] – In Java language this is ‘a’ number of arrays which contain ‘b’ number of elements. I.e. we have a+1 number of objects instead of just one.
int[a][b][k] In case of a three-dimensional one a * b + a + 1 number of object, etc.

Now let’s calculate and compare sizes of these arrays:
(40,000 * 200 + 40,000 + 1) * X + (40,000 * 200 * 2 * 4)
(2 * 200 + 2 + 1) * X + (40,000 * 200 * 2 * 4)

Clearly, the second part, which calculates the total size of elements based on int type, will be the same. But according to the first part we will have 8,039,598 excessive objects in case of the first array, consequently it will take considerable larger memory.

By the way, I cannot see if this number is real with profiler, do you have any idea how to check this?

Dynamic Array

Dynamic array is a data structure with variable length. One can insert, retrieve or delete an element using random access – i.e. it needs a fixed time to reach any element and this does not depend on whole size of the array.

You might come across this structure by different names – dynamic array, growable array, resizable array, dynamic table, mutable array, array list.

To guarantee random access, it is necessary to allocate a continuous memory for the array, i.e. its elements should be stored next to each other.

In case of a static array, this is not a problem, as we define the length in advance. But sometimes we don’t yet know the length. So, how much continuous memory should we allocate?

Clearly, there is no point for allocating huge memory just in case. We should not reserve million bytes in advance just to store 20 elements.

This is where dynamic array comes in. In many programming languages this problem is solved this way:
The dynamic array is backed by a static array, which is created in small size. When this static array is filled and users will try to add more elements, a larger static array will be created behind the scenes. Existing elements will be copied into it and the old one will be deleted. Consequently, insertion of some elements in the dynamic array will take more time than the others.

With this solution in mind, we should answer to an important question: By what factor should we increase the static array length?

It should be noted, that again we are searching for a balance between performance and memory waste. If we increase the array length with only one element, rewriting whole array on each new element will take too much time. But if we increase by a large factor, we might end up with a large empty array.

Optimal number is 2. This number might be slightly altered corresponding to requirements.

You might find its different versions in various programming languages – e.g. Vector in C++ is increased 2 times. Vector from Java standard library also has a factor 2, however you can change it by passing arguments. A static array of ArrayList in Java is increased by 3/2 times. A static array of HashMap – by 2 times. In the C implementation of Python, the number is a bit odd – approximately 9/8. Here is the source.. And here is an explanation.

If the programmer knows approximate size of an array in advance, they can configure the dynamic array correspondingly. E.g. Vector in C++ has a function reserve, which will reserve memory of given size.

ArrayList and HashMap classes in Java have a constructor parameter initialCapacity. In case of HashMap, not only static array is rewritten in the background, but the hashes are also regenerated.

If performance is critical, this parameter can be used. I carried out several experiments and saw the difference, however, in ordinary tasks this difference is not noticeable. Even in case of factor 2 and million elements, the arrays are rewritten only for 20 times.

In the beginning, I mentioned that some element insertion might take time from O(1) to O(n), where n is a total number of elements. Despite of this and based on amortized analysis, insertion time in a dynamic array is defined as O(1).

The idea of amortized analysis is to consider both, slow and fast operations of the algorithm. They might balance each other. While estimating an algorithm performance, we generally reach for the worst case scenario, but sometimes it is possible to calculate the ratio of expensive operations.

Let’s calculate the time for filling a dynamic array of n elements:
If we increase the length of an array by 2 times, we can estimate the number of rewrite like this:

Let’s start from the end. In the end it will need to rewrite all elements. Before that, only half of elements. Before that, quarter of elements, etc.
n + n/2 + n/4 + n/8 + … = n (1 + 1/2 + 1/4 + 1/8 + …) = 2n

Also, let’s add an insertion time of a new element and we get 3n. If we take an average, we will get O(3) = O(1) time for inserting an element.