Several months ago, thanks to a common interest, I met two brilliant persons – Erekle Magradze and Vazha Pirtskhalaishvili. Soon the idea was born to make this common interest stronger and find others like us – build community. So we started preparing a tech meetup, which eventually turned into a two-day conference. Anyone reading my blog probably has already attended or heard about it. I just wanted to keep a memory article here :))
In my opinion, Georgian tech world is facing an important decision. DevOps revolution started around 10 years ago and it has just reached our community. There are very few Georgian businesses, which harness the benefits of development process automation. If we don’t catch up, we’ll be out of the global market.
Most of the conference speakers shared experience they had with foreign companies. I’m not exaggerating, when I say that I really enjoyed every talk and listened to them for several times. I’m happy that participants found time and offered us a productive weekend. You can see their video presentations (in Georgian) on our FB page or Youtube: DevOps Con Tbilisi 2018 – Tech talks
I was always curious how people write enormous successful applications without bugs, for instance, the Mars Rover “Curiosity”, which flew to Mars, covered 350 million miles in 8 months, landed within 6 mile radius, walks on unknown land and sends data to the Earth.
How Rover was sent to Mars
There is one of the rare technical speeches about Mars Rover on the internet. Gerard Holzmann, one of the lead scientists of NASA JPL (Jet Propulsion Laboratory), describes how they were trying to manage software development processes and minimize the problems.
I will shortly mention few facts from the video:
Approximately 4 million lines of code, 100+ modules, 120 threads on one processor (+1 backup). Five years and 40 developers. And all of this was for one client and one-time use.
The project contained more code, that all previous Mars missions combined. Because of this scale, human code reviews were not effective any more.
They created standard to reduce risks and constantly checked code against it using automatic tools. As people used to ignore documents with hundreds of rules, they made a poll and selected out ten most important rules: Power of ten
E.g. don’t use recursion, goto and other complex flow constructs; variable scope must be minimal; all loops must have fixed edges; pointer usage must be limited to a single dereference, etc.
Every night the code was built, checked with static analysis and various automatic tests. It was at night, as analysis took 15 hours.
The one who broke the build, would receive a penalty, so they would have to put up a Britney Spears poster in their cubicle. And the one who would leave many warnings in their code, would be printed on the “Wall of Shame”. It seems, that people need motivation, even in NASA 😀
Person was not allowed to code until a specific training and certification.
They required 100% code coverage. If you have ever done this, you should know how heavy task this is, as they would need to test impossible cases, too.
Code review was not a long group meeting. They met only to discuss disagreements. Notes and directions were exchanged via a special application, which also showed the code status after previous night checkings.
The warnings from a compiler and static analysis had to be zero. This turned out to be a difficult task and took lots of time. The correlation of this requirement and the project success is unknown, but this was the cleanest code in comparison to their previous missions.
Of all the cases known on the internet, the most expensive code was written for Space Shuttle – 1000$/line. But in 2013 Toyota lost in court and if we calculate the compensation amount, the cost of one line would turn out as 1200$. The unintended acceleration of Toyta was in news for several times due to car accidents and complaints. They revoked pads, then acceleration pedals, but it was not enough. Then NASA team checked the software of a Toyota car against their standard and even though they found 243 violations, they could not confirm that software caused problems. Court had invited an external expert, who critisized Toyota software because of recursion, stack overflow and much more.
It turns out the we, software developers, risk too much. 🙂 We trust an OS, external libraries, don’t check the validity of value returned from function. Although we filter the user input, do we do the same while communicating to various inner services? In my opinion, this is natural. Checking everything is very timeconsuming and, consequently, expensive. You can look at some Defensive programming strategies.
There are applications, which almost never make mistakes, but when it does, it brings huge loss. Similarly, there are applications, which has more bugs, but correcting them is simple and cheap. Probably the same as with cars – it’s expensive to recover a BMW, which is rarely broken. And during the war, US had Willys MB jeeps, which were recovered very quickly. They would simply disassemble the car to relocate. There even is a video where Canadian soldiers disassemble and assemble Jeep in under 4 minutes.
Our applications live in an imperfect world, communicate using unstable network and call unguaranteed resources. Lately, with the rise of service world, even more failure points appeared. The source code often contains more of insurance code than the actual business logic. I’d like to write about one of their kind – when the service call is unsuccessful.
The fail result might differ in nature: Transient – For instance,
– 503 Service unavailable – when the service is overloaded or temporary disabled for maintenance;
– 504 Gateway Timeout – when proxy servers don’t get response from back servers in time;
– Also any timeout, when there is no response from server at all.
These are transient errors and might resolve after some time.
Permanent – Incorrect password error will never resolve with time.
When we determine that we are getting a transient error, we can start retrying every few seconds. Just one issue: If we hit timeout because of a server overload, our retries will increase the request queue even more and prevent service from recovering. There are few design patterns to tackle this problem:
You might have noticed, when you have GMail opened in the browser and internet goes off, the notification comes up “Connecting in 1s…”. At first it will retry in 1 second, then in 2 seconds, then in 4; then 8 and increases the delay time exponentially like that. Sometimes it even reaches hours.
The case is not only with the browser-server communication. This kind of problematic connections can be present totally on the server side – among different components. Sometimes ‘randomness’ is introduced for better performance. For instance, both methods are used in Amazon AWS architecture: Exponential Backoff and Jitter.
In randomness I mean that instead of having 4 seconds delay, there could be X seconds, where X is random number between 1 and 4.
I use these numbers for the sake of explanation. Clearly, we will need several constants:
Base delay time, maximal number of attempts and maximal delay time.
In a similar fashion, when the invocation is unsuccessful, we can make more and more delay instead of retrying every other second and let the service recover.
Exponential backoff algorithm is also used in the Ethernet protocol. When two machines on the same network try to send the packet simultaneously, collision happens. If they repeat the action after the same delay, they will collide again and again forever. Consequently, the delay formula roughly looks like this: 0 ≤ r < 2^k where k = min (n, 10) n is the number of collisions and r is selected randomly between 0 and 2^k. k is number of collisions topped by 10. So, the more collisions happen, the more top limit is increased (exponentially). Then there is even less probability to get same delay time randomly.
This algorithm is very much like an electric circuit breaker, which we have at home. An intermediary object is placed (on client side) between the client and a server, which serves as a service protector. The requests are sent through this object. When it notices a high rate of failed responses, it will trip to an ‘Open’ state and won’t pass client requests, but rather respond to them itself with a failure.
In case of an electric one, we have to manually switch it back to the initial state, but we can’t do it here, so after some time interval, this object should automatically switch into a “Half-Open” state and let one request pass to the service. Based on the response, it will either stay in the “Open” state, or move to “Closed” one and will pass all requests.
In some frameworks (especially while communicating with the database), these kinds of algorithms are already implemented.
If you have a public API and wish that unknown clients stick to a better mechanism of retry – i.e not to totally kill your service during troubles, then you can write client libraries yourself for several languages and users will use them instead.