Fault Tolerance and System Resilience Questions

Designing systems to anticipate, tolerate, contain, and recover from component and network failures while minimizing customer impact and preserving correctness. Topics include identifying common failure modes and single points of failure, redundancy and isolation patterns at hardware, service, and geographic levels, and failover strategies including active active and active passive. Cover retry policies with exponential backoff, timeouts, circuit breaker and bulkhead patterns, graceful degradation, rate limiting, and backpressure techniques to protect systems during overload. Discuss orchestration of node rejoin and state rebuild, replication strategies and consistency trade offs, leader election and consensus implications, and techniques to avoid and mitigate split brain. Explain monitoring, health checks, alerting, and metrics such as mean time to recovery and mean time between failures to guide operational improvements. Include testing for resilience through chaos engineering and fault injection, handling flaky components in test environments, analysis of past failures and refactoring for resiliency, and operational practices that reduce blast radius and speed recovery.

MediumSystem Design

0 practiced

Design a quota and rate-limiting system for submission of distributed training jobs to a shared GPU cluster. Include per-team quotas, priority preemption rules, delay queues, backpressure feedback to submitters, and fair-share algorithms to prevent cluster overload.

HardSystem Design

0 practiced

Architect a globally distributed serving platform for a large language model (tens of GBs) that must handle 1M requests/min with 200ms P95 latency across three geographic regions. Cover model sharding/replication, GPU autoscaling, inference caching, cold-start strategies, multi-region failover, personalization data consistency, and privacy constraints.

MediumTechnical

0 practiced

Design observability for a model-serving platform: list concrete SLIs and SLOs you would track (latency percentiles, error rates, model quality metrics, input distribution drift, and feature coverage). Explain alert thresholds, dashboards, and automated mitigations to reduce toil and MTTR.

HardSystem Design

0 practiced

For personalized recommendations requiring low latency and high availability across regions, evaluate consistency models (strong, causal, eventual). Propose a hybrid architecture that balances freshness and availability, describing how writes, caches, and conflict resolution would operate.

MediumSystem Design

0 practiced

Design a distributed rate-limiter for an inference API that must support 100k RPS globally, enforce per-tenant and global limits, ensure fair sharing, and avoid single points of failure. Describe algorithm choices (token bucket, leaky bucket, sliding window), storage/backends, and how to handle bursts and clock skew.

Unlock Full Question Bank

Get access to hundreds of Fault Tolerance and System Resilience interview questions and detailed answers.

Join thousands of developers preparing for their dream job.