Ability of the system to stay online despite having failures at the infrastructural level in real-time.
Improves the reliability of the system, ensuring minimum downtime.
Why do systems go down?
- Software crashes
- Hardware failures
- Human errors, such as configuration
- Planned downtime
- System level: When many redundant nodes are deployed, there are no single points of failure.
In case a node goes down redundant nodes take its place.
Thus, the system as a whole remains unimpacted. - Application level: Bottlenecks are the single points of failure.
Ability of a system to not go down entirely (can work at a reduced level) in case of internal failures.
- To achieve HA at the application level, the entire massive service is architecturally broken down into smaller loosely coupled services.
- These microservices each have a single responsibility; this ensures that even if a few services go down, the application as a whole is still up.
Duplicating the components or instances and keeping them on standby to take over in case the active instances go down.
Also known as Active-Passive HA mode.
Duplicating the components or instances and running them together, sharing workload.
There are no standby or passive instances.
When a single or a few nodes go down, the remaining nodes bear the load of the service.
A cluster has a set of nodes that run in conjunction with each other, taking over if one node fails.
State across nodes is maintained with the help of shared memory.
Nodes are connected by a private network called Heartbeat network (like Apache ZooKeeper), which monitors health and status of each node.