Covers strategies and tooling for observing network health and performance. Topics include active health checks versus passive telemetry, what to measure at interface and flow level, flow based telemetry such as NetFlow and sFlow and export formats such as Internet Protocol Flow Information Export, Simple Network Management Protocol based metrics, metrics hierarchy and granularity, retention and aggregation considerations, alerting strategy to manage signal to noise and avoid alert fatigue, dashboards and status pages, runbook and incident playbooks, topology and capacity planning, and common observability platforms and integrations such as Prometheus the Elastic stack and Splunk or cloud native alternatives. Interviews evaluate ability to design what to monitor how to collect it and how to turn telemetry into reliable operational signals.
MediumTechnical
0 practiced
Prometheus is facing high-cardinality from per-flow labels in your network metrics. Describe concrete strategies you would apply: relabeling/drop-labels, recording rules for aggregates, using a separate high-cardinality store, or offloading flow-level telemetry to a different system. For each strategy explain pros and cons.
HardTechnical
0 practiced
Create a detailed alerting policy for a critical network fabric SLO: 99.99% packet delivery within 100ms. Define detection windows and aggregation rules, alert severity levels tied to error budget burn, notification and escalation flows, automated mitigation steps, and how SREs should respond at each severity.
MediumTechnical
0 practiced
Describe how you would use eBPF to collect network telemetry from Linux hosts. Cover typical attach points (XDP, tc, kprobes), what per-packet or socket-level metrics you can obtain, how to aggregate in-kernel vs userspace, and limitations such as kernel compatibility and overhead.
HardSystem Design
0 practiced
Architect a global network observability platform for a cloud provider with these requirements: 10,000 devices, ingest 50M flow records/sec, real-time alerting, 1s resolution for 7 days and 1m resolution afterwards retained for 1 year. Describe collectors, transport, stream processing, storage tiers, query patterns, HA, and cost controls.
HardTechnical
0 practiced
Design an anomaly-detection alerting pipeline to reduce false positives in network telemetry. Cover feature engineering (bytes/sec, flows/sec, packet-drops, pct-change), model choice (unsupervised like isolation-forest vs supervised), training data requirements, deployment for online inference, feedback loop for labels, and handling concept drift.
Unlock Full Question Bank
Get access to hundreds of Network Monitoring and Observability interview questions and detailed answers.