I was always curious how people write enormous successful applications without bugs, for instance, the Mars Rover “Curiosity”, which flew to Mars, covered 350 million miles in 8 months, landed within 6 mile radius, walks on unknown land and sends data to the Earth.
How Rover was sent to Mars
There is one of the rare technical speeches about Mars Rover on the internet. Gerard Holzmann, one of the lead scientists of NASA JPL (Jet Propulsion Laboratory), describes how they were trying to manage software development processes and minimize the problems.
I will shortly mention few facts from the video:
- Approximately 4 million lines of code, 100+ modules, 120 threads on one processor (+1 backup). Five years and 40 developers. And all of this was for one client and one-time use.
- The project contained more code, that all previous Mars missions combined. Because of this scale, human code reviews were not effective any more.
- They created standard to reduce risks and constantly checked code against it using automatic tools. As people used to ignore documents with hundreds of rules, they made a poll and selected out ten most important rules: Power of ten
E.g. don’t use recursion, goto and other complex flow constructs; variable scope must be minimal; all loops must have fixed edges; pointer usage must be limited to a single dereference, etc. - Every night the code was built, checked with static analysis and various automatic tests. It was at night, as analysis took 15 hours.
- The one who broke the build, would receive a penalty, so they would have to put up a Britney Spears poster in their cubicle. And the one who would leave many warnings in their code, would be printed on the “Wall of Shame”. It seems, that people need motivation, even in NASA 😀
- Person was not allowed to code until a specific training and certification.
- They required 100% code coverage. If you have ever done this, you should know how heavy task this is, as they would need to test impossible cases, too.
- Code review was not a long group meeting. They met only to discuss disagreements. Notes and directions were exchanged via a special application, which also showed the code status after previous night checkings.
- The warnings from a compiler and static analysis had to be zero. This turned out to be a difficult task and took lots of time. The correlation of this requirement and the project success is unknown, but this was the cleanest code in comparison to their previous missions.
- In critical parts they followed the MISRA programming standard – which is used in engines and life-critical equipment.
- They performed logical verification of critical subsystems – mathematically proved correctness of the algorithm.
If you are interested in this mission, there is one more speech (although I liked the first one more): CppCon 2014: Mark Maimone “C++ on Mars: Incorporating C++ into Mars Rover Flight Software”
About violating the standard
Of all the cases known on the internet, the most expensive code was written for Space Shuttle – 1000$/line. But in 2013 Toyota lost in court and if we calculate the compensation amount, the cost of one line would turn out as 1200$. The unintended acceleration of Toyta was in news for several times due to car accidents and complaints. They revoked pads, then acceleration pedals, but it was not enough. Then NASA team checked the software of a Toyota car against their standard and even though they found 243 violations, they could not confirm that software caused problems. Court had invited an external expert, who critisized Toyota software because of recursion, stack overflow and much more.
Whole case is here: A Case Study of Toyota Unintended Acceleration and Software Safety
We, mere mortal developers
It turns out the we, software developers, risk too much. 🙂 We trust an OS, external libraries, don’t check the validity of value returned from function. Although we filter the user input, do we do the same while communicating to various inner services? In my opinion, this is natural. Checking everything is very timeconsuming and, consequently, expensive. You can look at some Defensive programming strategies.
There are applications, which almost never make mistakes, but when it does, it brings huge loss. Similarly, there are applications, which has more bugs, but correcting them is simple and cheap. Probably the same as with cars – it’s expensive to recover a BMW, which is rarely broken. And during the war, US had Willys MB jeeps, which were recovered very quickly. They would simply disassemble the car to relocate. There even is a video where Canadian soldiers disassemble and assemble Jeep in under 4 minutes.
I think, most of our applications are in the later category. Important thing is to be able to make changes quickly and with minimal expenses.