Fail Early – a Mantra for Business Continuity Test Scenarios

In what used to be a business climate orientated firmly towards success, the notion of constructive failure has changed attitudes, hopefully opening up new possibilities for progress by liberating organisations from the notion that all failure was bad. There’s a message for business continuity test scenarios as well – it’s the “fail early, fail cheaply” mantra of entrepreneurs and innovators, who know that it takes a few false starts to home in on the winning formula. One company, Netflix, has taken this to heart with an in-house tool called the “Chaos Monkey”.

Netflix is a US-based provider of on-demand streaming media over the web (it also sends out DVDs by post), with over 23 million subscribers, based on April 2011 figures. Also in 2011, the company was the biggest generator of North American Internet traffic, responsible for over 24%. Amazon Web Services (AWS) form part of the infrastructure for Netflix, which naturally enough has a vested interest in making sure that network service for subscribers is continually and satisfactorily available. Yet, in line with the “fail early” approach, the company’s approach to business continuity test scenarios is that “the best way to avoid failure is to fail constantly”.

To that end, the “Chaos Monkey” randomly terminates processes and servers within the AWS infrastructure that Netflix uses. The goal is to maintain satisfactory service even in the event of component failure. For instance, even if the personalised film choices cannot be displayed, the systems will at least give subscribers a list of popular titles. The Chaos Monkey wreaks its havoc on real, live systems as well, because that’s where Netflix sees the most (or even only) useful business continuity test scenarios. By constantly and deliberately provoking failures early, the company will have a better chance of “business as usual” if something other than the Chaos Monkey messes with the infrastructure.