A Theorem for IT Disaster Recovery – But With Practical Application

If you look through the literature on disaster recovery, you’ll probably see that practical ideas, recommendations and methods abound – but that theory is in rather shorter supply. This makes sense in that all those IT systems and networks are running now – so if they break, you’ll want some good ‘cookbooks’ or ‘how-to’s’ for mending them rapidly. However with DR management comes DR planning, which is the chance to step back and better understand the key principles that govern effective DR. The CAP theorem for distributed IT systems is one example. Better still, it’s simple to grasp and has immediate practical application.

In a nutshell, the CAP theorem says that for any distributed system, you will have to make some trade-off between good consistency (C), good availability (A) and good partition tolerance (P). Conventional relational or SQL databases for example typically do well on the first two (C and A). However, they are notorious for being difficult to expand, especially in terms of trying to run such a database over a distributed network of systems. They therefore have lower partition tolerance, meaning they are less able to operate if part of the networked system fails. Your disaster recovery planning in this case is likely to start off without a distributed system architecture, because to use one would be to invite disaster.

Are there other systems that do better for partition tolerance? Yes – new generation databases called NoSQL (standing for ‘Not Only SQL’) databases can offer much higher tolerance to partitioning (P). But you’ll have to give up capability either in terms of consistency (C) or availability (A). However, Eric Brewer who formulated the CAP theorem back in 2000 has pointed out that the CAP theorem is more about a continuum than about ‘all or nothing’ states. You can choose how you organise C, A and P at system, subsystem and even data level. So while you can’t escape the CAP theorem, its insights can give you considerable flexibility in planning your system architecture and your disaster recovery.