High Availability and Disaster Recovery Questions

Designing systems to remain available and recoverable in the face of infrastructure failures, outages, and disasters. Candidates should be able to define and reason about Recovery Time Objective and Recovery Point Objective targets and translate service level agreement goals such as 99.9 percent to 99.999 percent into architecture choices. Core topics include redundancy strategies such as N plus one and N plus two, active active and active passive deployment patterns, multi availability zone and multi region topologies, and the trade offs between same region high availability and cross region disaster recovery. Discuss load balancing and traffic shaping, redundant load balancer design, and algorithms such as round robin, least connections, and consistent hashing. Explain failover detection, health checks, automated versus manual failover, convergence and recovery timing, and orchestration of failover and reroute. Cover backup, snapshot, and restore strategies, replication and consistency trade offs for stateful components, leader election and split brain mitigation, runbooks and recovery playbooks, disaster recovery testing and drills, and cost and operational trade offs. Include capacity planning, autoscaling, network redundancy, and considerations for security and infrastructure hardening so that identity, key management, and logging remain available and recoverable. Emphasize monitoring, observability, alerting for availability signals, and validation through chaos engineering and regular failover exercises.

EasyTechnical

0 practiced

Explain these load-balancing algorithms: round robin, least connections, and consistent hashing. For each algorithm describe a scenario where it is optimal and one where it causes problems, particularly for stateful backends or user session affinity.

EasyTechnical

0 practiced

Explain error budgets and the relationship between SLOs, SLIs, and SLAs. Show with a concrete calculation how you derive an error budget for a monthly SLO and describe operational actions you should take when the error budget is being consumed rapidly.

MediumTechnical

0 practiced

Write a health-check script in Python that performs three checks: 1) HTTP GET to /health expecting a 200, 2) a quick ping/query to a local cache or redis endpoint, and 3) ensure disk utilization is under 80 percent. The script should exit 0 for healthy and non-zero for unhealthy. Describe how to differentiate liveness vs readiness probe usage.

EasyTechnical

0 practiced

Explain N+1 and N+2 redundancy strategies. For compute, load balancers, and critical network devices, give examples of how you'd size capacity, what failures each protects against, and the cost and maintenance trade offs.

EasyTechnical

0 practiced

List common health check types used to detect service failure (for example TCP probe, HTTP probe, application-specific checks). Provide example checks for an API that depends on a database and cache, and explain ways to avoid false positives and negatives.

Unlock Full Question Bank

Get access to hundreds of High Availability and Disaster Recovery interview questions and detailed answers.

Join thousands of developers preparing for their dream job.