High Availability vs Fault Tolerance vs Disaster Recovery
Last updated:
HA is about minimizing failure. FT is about minimizing failure + operating through failures
High Availability (HA)
- Aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period
- Instead of diagnosing the issue, if you have a process ready to replace it, it can be fixed quickly and probably in an automated way.
- Spare infrastructure ready to switch customers over to in the event of a disaster to minimize downtime
- User disruption is not ideal, but is allowed
- The user might have a small disruption or might need to log back in.
- Maximizing a system’s uptime
- 99.9% (Three 9’s) = 8.7 hours downtime per year.
- 99.999 (Five 9’s) = 5.26 minutes downtime per year.
Fault-Tolerance (FT)
- System can continue operating properly in the event of the failure of some (one or more faults within) of itscomponents
- Fault tolerance is much more complicated than high availability and more expensive. Outages must be minimized and the system needs levels of redundancy.
- An airplane is an example of system that needs Fault Tolerance. It has more engines than it needs so it can operate through failure.
Example: A patient is waiting for a life saving surgery and is under anesthetic. While being monitored, the life support system is dosing medicine. This type of system cannot only be highly available, even a movement of interruption is deadly.
Disaster Recovery (DR)
- Set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
- DR can largely be automated to eliminate the time for recovery and errors.
references: