Real Time and Batch Ingestion Questions

Focuses on choosing between batch ingestion and real time streaming for moving data from sources to storage and downstream systems. Topics include latency and throughput requirements, cost and operational complexity, consistency and delivery semantics such as at least once and exactly once, idempotent and deduplication strategies, schema evolution, connector and source considerations, backpressure and buffering, checkpointing and state management, and tooling choices for streaming and batch. Candidates should be able to design hybrid architectures that combine streaming for low latency needs with batch pipelines for large backfills or heavy aggregations and explain operational trade offs such as monitoring, scaling, failure recovery, and debugging.

HardTechnical

0 practiced

For a model that requires joining two high-cardinality streams in real time, propose join strategies, state sharding schemes, windowing choices, and skew mitigation techniques. Discuss trade-offs between correctness, latency, and resource usage.

MediumTechnical

0 practiced

How do you ensure data lineage, governance, and reproducibility for pipelines that combine streaming ingestion and batch ETL? Describe metadata tracking, schema/versioning, access controls, and how replayability is supported for model retraining and audits.

HardTechnical

0 practiced

Describe the operational runbooks and SLOs you would define for a mission-critical ingestion pipeline that feeds real-time ML serving. Include incident-response steps for consumer lag, corrupt events, broker failures, rollbacks, acceptable data-loss tolerances, and RCA procedures to prevent recurrence.

EasyTechnical

0 practiced

Implement a simple Python Kafka consumer that writes events into a PostgreSQL table idempotently. Assume events have fields 'event_id' (unique), 'user_id', and 'payload'. In pseudo-code show how you would ensure duplicates are not double-inserted when the consumer retries. Mention transaction boundaries and offset commit strategy (use psycopg2 and a Kafka consumer API).

MediumTechnical

0 practiced

Compare Kafka Streams, Apache Flink, and Spark Structured Streaming for computing ML features in production. Focus on latency, state management, exactly-once support, windowing semantics, and operational complexity for high-cardinality state and stream-stream joins.

Unlock Full Question Bank

Get access to hundreds of Real Time and Batch Ingestion interview questions and detailed answers.

Join thousands of developers preparing for their dream job.