Is that Maximum Tolerable Outage, Maximum Tolerable Disruption, or What?

Maximum tolerable outage or MTO is a common measure in both disaster recovery and business continuity. It is the maximum amount of time a system or resource can remain unavailable before its loss starts to have an unacceptable impact on the goals or the survival of an organisation. It’s either on or off, and if it’s off, then it has to be put back on before the defined MTO expires. Yet a system may continue to work in a degraded mode, which is neither on nor off, but slower, less efficiently, with higher errors, or similar problems. Where does MTO stand when something is neither fully on nor off?

The BS 25999 standard has a partial answer to this. It uses the name “Maximum Tolerable Period of Disruption” (MTPOD), rather than maximum tolerable outage. Taken literally, this applies not just to systems that have stopped working, but also to systems that are not working as they should normally. This is an important distinction, because organisations can cease to function over a period of months, even years, just as they can more abruptly cease to function within hours. IT systems for example that fail to properly control product shipments can lead to customer dissatisfaction, loss of revenue and ultimately company failure.

Should the concept be taken further? To avoid “death by a thousand cuts”, it may be relevant to define “Maximum Tolerable Degradation”, which would combine both a quantitative measure of performance and a defined time interval. Such an “MTDEG” measure would limit the amount of time over which a warehouse could function with a calculated maximum decrease in shipment quality; or the interval during which a children’s school could function with room temperatures 2 or 3 degrees outside of accepted norms. The point is to avoid the trap of assuming that just because something isn’t fully off, then it must be on as normal.