Big Data Technologies Stack Questions

Overview of big data tooling and platforms used for data ingestion, processing, and analytics at scale. Includes frameworks and platforms such as Apache Spark, Hadoop ecosystem components (HDFS, MapReduce, YARN), data lake architectures, streaming and batch processing, and cloud-based data platforms. Covers data processing paradigms, distributed storage and compute, data quality, and best practices for building robust data pipelines and analytics infrastructure.

EasyTechnical

0 practiced

Explain the difference between batch and streaming data processing paradigms. For each, describe typical SLAs, latency and throughput trade-offs, complexity of state management, and an example use-case where batch is preferable and one where streaming is required.

MediumTechnical

0 practiced

Technical coding (PySpark): Implement sessionization using the PySpark DataFrame API. Given schema (user_id STRING, event_ts TIMESTAMP, event_type STRING), group events into sessions where 30 minutes of inactivity defines a new session. Output: (user_id, session_id, session_start, session_end, event_count). Provide code, explain correctness, and discuss complexity and state requirements.

MediumTechnical

0 practiced

Explain schema evolution strategies for Avro and Parquet in a data lake. How do you handle added or removed fields, default values, backward/forward compatibility, and integration with a schema registry? Discuss the implications for existing consumers and compaction jobs.

HardTechnical

0 practiced

You observe a Parquet table with hundreds of thousands of small files, causing slow queries and metadata overhead. Propose a plan to fix the small-files problem: compaction strategies, tuning writers to produce optimal file sizes, scheduling compaction jobs, and mitigations to avoid impacting SLAs during compaction.

HardTechnical

0 practiced

Describe a safe schema migration strategy for production data warehouse tables that must support breaking changes (column removals, type changes) without downtime for downstream consumers. Include steps such as dual-writing, feature flags, consumer migrations, backfills, compatibility checks, and automated verification.

Unlock Full Question Bank

Get access to hundreds of Big Data Technologies Stack interview questions and detailed answers.

Join thousands of developers preparing for their dream job.