Cloud & Infrastructure Topics
Cloud platform services, infrastructure architecture, Infrastructure as Code, environment provisioning, and infrastructure operations. Covers cloud service selection, infrastructure provisioning patterns, container orchestration (Kubernetes), multi-cloud and hybrid architectures, infrastructure cost optimization, and cloud platform operations. For CI/CD pipeline and deployment automation, see DevOps & Release Engineering. For cloud security implementation, see Security Engineering & Operations. For data infrastructure design, see Data Engineering & Analytics Infrastructure.
Cloud Platform Experience
Personal account of hands on experience using public cloud providers and the concrete results delivered. Candidates should describe specific services and patterns they used for compute, storage, networking, managed databases, serverless and eventing, and explain their role in architecture decisions, deployments, automation and infrastructure as code practices, continuous integration and continuous delivery pipelines, container orchestration, scaling and performance tuning, monitoring and incident response, and cost management. Interviewees should quantify outcomes when possible with metrics such as latency reduction, cost savings, availability improvements or deployment frequency and note any formal training or certifications. This topic evaluates depth of practical experience, ownership, and the ability to operate and improve cloud systems in production.
Your SRE Background and Experience
Articulate your hands-on experience with systems administration, monitoring tools, automation scripts, and any incident response involvement. Be specific about technologies (e.g., Prometheus, Grafana, Kubernetes, Docker, Terraform) and concrete examples of what you've built or fixed.
Load Balancing and Horizontal Scaling
Covers principles and mechanisms for distributing traffic and scaling services horizontally. Includes load balancing algorithms such as round robin, least connections, and consistent hashing; health checks, connection draining, and sticky sessions; and session management strategies for stateless and stateful services. Explains when to scale horizontally versus vertically, capacity planning, and trade offs of each approach. Also includes infrastructure level autoscaling concepts such as auto scaling groups, launch templates, target tracking and step scaling policies, and how load balancers and autoscaling interact to absorb traffic spikes. Reviews different load balancer types and selection criteria, integration with service discovery, and operational concerns for maintaining availability and performance at scale.
Network Monitoring and Observability
Covers strategies and tooling for observing network health and performance. Topics include active health checks versus passive telemetry, what to measure at interface and flow level, flow based telemetry such as NetFlow and sFlow and export formats such as Internet Protocol Flow Information Export, Simple Network Management Protocol based metrics, metrics hierarchy and granularity, retention and aggregation considerations, alerting strategy to manage signal to noise and avoid alert fatigue, dashboards and status pages, runbook and incident playbooks, topology and capacity planning, and common observability platforms and integrations such as Prometheus the Elastic stack and Splunk or cloud native alternatives. Interviews evaluate ability to design what to monitor how to collect it and how to turn telemetry into reliable operational signals.
Technical Vision and Infrastructure Roadmap
This topic assesses a candidate's ability to define a multi year technical vision for infrastructure, platform, and systems and to translate that vision into a practical execution roadmap. Core skills include evaluating technology choices and architecture evolution, planning migration and modernization paths, anticipating scalability and capacity needs, and balancing cost performance with resilience and operational reliability. Candidates should demonstrate approaches to managing technical debt, sequencing investments across quarters and releases, estimating resources and timelines, establishing measurable infrastructure goals and key performance indicators, and implementing governance and standards. Discussion may also cover reliability and observability, security and compliance considerations, trade offs between short term stability and long term rearchitecture, prioritization to enable business outcomes, and communicating technical trade offs to both technical and non technical stakeholders.
Transport Layer Protocols
Comprehensive understanding of transport layer protocols, primarily Transmission Control Protocol (TCP) and User Datagram Protocol (UDP), and related protocols used for diagnostics such as Internet Control Message Protocol (ICMP). Candidates should be able to explain TCP as a connection oriented, reliable, ordered, and flow controlled protocol including the three way handshake for connection establishment, the four step connection teardown, retransmission and timeout behavior, and high level congestion control and flow control mechanisms. Describe TCP header structure and key fields used for reliability and ordering. Explain UDP as a connectionless, best effort, lower latency protocol, its datagram model, simple header structure, and trade offs for reliability and ordering. Give real world use cases and justify protocol choice, for example reliable file transfer and web traffic versus low latency streaming, real time voice, and many DNS queries. Discuss port numbers and common service ports such as HTTP port 80, HTTPS port 443, DNS port 53, SSH port 22, and SMTP port 25, and how sockets and ports map to endpoints. Cover practical topics such as when UDP may fall back to TCP, how fragmentation and packet loss affect each protocol, and the role of ICMP for network diagnostics and error reporting.
Multi Region and Multi Cloud Resilience
Designing systems that work across multiple geographic regions or cloud providers. This addresses the highest reliability requirements and provides protection against provider-level failures. At senior level, understand data replication across regions, latency implications, consistency trade-offs, and cost of multi-region deployments. Design routing policies that direct traffic to healthy regions. Address compliance requirements that may mandate geographic distribution.
Large Scale Infrastructure Challenges
Awareness of engineering and operational challenges at massive scale including global network optimization, multi region failover and redundancy, integration of cloud and on premise systems, security and compliance at scale, performance and latency for a global user base, cost optimization across large fleets, and maintaining reliability without exponential operational complexity. Candidates should demonstrate thinking about architecture patterns, trade offs, monitoring and incident response at scale, and strategies for evolving platform capabilities as load and feature sets grow.
Capacity Planning and Resource Optimization
Covers forecasting, provisioning, and operating compute, memory, storage, and network resources efficiently to meet demand and service level objectives. Key skills include monitoring resource utilization metrics such as central processing unit usage, memory consumption, storage input and output and network throughput; analyzing historical trends and workload patterns to predict future demand; and planning capacity additions, safety margins, and buffer sizing. Candidates should understand vertical versus horizontal scaling, autoscaling policy design and cooldowns, right sizing instances or containers, workload placement and isolation, load balancing algorithms, and use of spot or preemptible capacity for interruptible workloads. Practical topics include storage planning and archival strategies, database memory tuning and buffer sizing, batching and off peak processing, model compression and inference optimization for machine learning workloads, alerts and dashboards, stress and validation testing of planned changes, and methods to measure that capacity decisions meet both performance and cost objectives.