Skip to content

Latest commit

 

History

History
38 lines (30 loc) · 1.86 KB

4-high-availability.md

File metadata and controls

38 lines (30 loc) · 1.86 KB

High Availability

Ability of the system to stay online despite having failures at the infrastructural level in real-time.
Improves the reliability of the system, ensuring minimum downtime.

Why do systems go down?

  • Software crashes
  • Hardware failures
  • Human errors, such as configuration
  • Planned downtime

Achieving HA

Eliminate single points of failure

  • System level: When many redundant nodes are deployed, there are no single points of failure.
    In case a node goes down redundant nodes take its place.
    Thus, the system as a whole remains unimpacted.
  • Application level: Bottlenecks are the single points of failure.

Fault Tolerance

Ability of a system to not go down entirely (can work at a reduced level) in case of internal failures.

  • To achieve HA at the application level, the entire massive service is architecturally broken down into smaller loosely coupled services.
  • These microservices each have a single responsibility; this ensures that even if a few services go down, the application as a whole is still up.

Redundancy (Active-Passive HA)

Duplicating the components or instances and keeping them on standby to take over in case the active instances go down.
Also known as Active-Passive HA mode.

Replication (Active-Active HA)

Duplicating the components or instances and running them together, sharing workload.
There are no standby or passive instances.
When a single or a few nodes go down, the remaining nodes bear the load of the service.

HA Clustering (Fail-over cluster)

A cluster has a set of nodes that run in conjunction with each other, taking over if one node fails.
State across nodes is maintained with the help of shared memory.
Nodes are connected by a private network called Heartbeat network (like Apache ZooKeeper), which monitors health and status of each node.