Unit Testing, TDD and the Shuttle Disaster

I was reading the Feynman report about the Shuttle disaster: "Appendix F - Personal observations on the reliability of the Shuttle" and I was freaked out by the similarities of military engine development and bottom-up, test driven development. There is a small passage in the report about how military engines are built:

The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (for turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge larger component parts (such as bearings) are designed and tested individually. As deficiencies and design errors are noted they are corrected and verified with further testing. Since one tests only parts at a time these tests and modifications are not overly expensive. Finally one works up to the final design of the entire engine, to the necessary specifications. There is a good chance, by this time that the engine will generally succeed, or that any failures are easily isolated and analyzed because the failure modes, limitations of materials, etc., are so well understood. There is a very good chance that the modifications to the engine to get around the final difficulties are not very hard to make, for most of the serious problems have already been discovered and dealt with in the earlier, less expensive, stages of the process.

This sounds a lot like Unit Testing to me. Writing small parts of an application, testing the part, then integrating it. And even if this is not TDD (not possible with hardware?), then it sound similar, contrary to writing all code first and writing the tests last.

Compare this approach with the way NASA desigened the Shuttle Main Engine:

The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes. For example, cracks have been found in the turbine blades of the high pressure oxygen turbopump. Are they caused by flaws in the material, the effect of the oxygen atmosphere on the properties of the material, the thermal stresses of startup or shutdown, the vibration and stresses of steady running, or mainly at some resonance at certain speeds, etc.? How long can we run from crack initiation to crack failure, and how does this depend on power level? Using the completed engine as a test bed to resolve such questions is extremely expensive. One does not wish to lose an entire engine in order to find out where and how failure occurs. Yet, an accurate knowledge of this information is essential to acquire a confidence in the engine reliability in use. Without detailed understanding, confidence can not be attained.

A further disadvantage of the top-down method is that, if an understanding of a fault is obtained, a simple fix, such as a new shape for the turbine housing, may be impossible to implement without a redesign of the entire engine."

This sounds a lot like traditional, up front software development. With the same problems. When errors occure, "are they caused by flaws in the material [...]" or where do they come from? It's hard to decide which component is the root cause of an error in a complex system. Astonishingly Feynman sees another corresponding disadvantage with top-down versus bottom-up. Problems that arise may be too big to fix in a conventional way, the engine architecture needs to be redesigned. This happens with software too. If you do too much up front architecture, you may end with an architecture which doesn't fit your problems (usually this means a long and difficult rewrite - something you should only do as a last resort). Going bottom up, best with Test Driven Development (TDD), you can't end with a wrong architecture (with merciless small refactorings and path adjustments on the way of course). And usually you're flexible enough with an architecture which was driven by unit testing to react to all changes on your way (scalability, performance etc.)

The engine development success and the shuttle problems compared show convincingly how developing in small steps with components and merciless testing results in easy to debug components with a low error rate. You should test more.

Thanks for listening. As ever, please do share your thoughts and additional tips in the comments below, or on your own blog (I have trackbacks enabled).

Stephan Schmidt Administrator
CTO Coach , svese
Stephan is a CTO coach. He has been a coder since the early 80s, has founded several startups and worked in small and large companies as CTO. After he sold his latest startup he took up CTO coaching. He can be found on LinkedIn or follow him in Twitter.
follow me