Big Data Technologies Stack Questions
Overview of big data tooling and platforms used for data ingestion, processing, and analytics at scale. Includes frameworks and platforms such as Apache Spark, Hadoop ecosystem components (HDFS, MapReduce, YARN), data lake architectures, streaming and batch processing, and cloud-based data platforms. Covers data processing paradigms, distributed storage and compute, data quality, and best practices for building robust data pipelines and analytics infrastructure.
EasyTechnical
0 practiced
Explain the difference between batch and streaming data processing paradigms. For each, describe typical SLAs, latency and throughput trade-offs, complexity of state management, and an example use-case where batch is preferable and one where streaming is required.
MediumTechnical
0 practiced
Technical coding (PySpark): Implement sessionization using the PySpark DataFrame API. Given schema (user_id STRING, event_ts TIMESTAMP, event_type STRING), group events into sessions where 30 minutes of inactivity defines a new session. Output: (user_id, session_id, session_start, session_end, event_count). Provide code, explain correctness, and discuss complexity and state requirements.
MediumTechnical
0 practiced
Explain schema evolution strategies for Avro and Parquet in a data lake. How do you handle added or removed fields, default values, backward/forward compatibility, and integration with a schema registry? Discuss the implications for existing consumers and compaction jobs.
HardTechnical
0 practiced
You observe a Parquet table with hundreds of thousands of small files, causing slow queries and metadata overhead. Propose a plan to fix the small-files problem: compaction strategies, tuning writers to produce optimal file sizes, scheduling compaction jobs, and mitigations to avoid impacting SLAs during compaction.
HardTechnical
0 practiced
Describe a safe schema migration strategy for production data warehouse tables that must support breaking changes (column removals, type changes) without downtime for downstream consumers. Include steps such as dual-writing, feature flags, consumer migrations, backfills, compatibility checks, and automated verification.
Unlock Full Question Bank
Get access to hundreds of Big Data Technologies Stack interview questions and detailed answers.
Sign in to ContinueJoin thousands of developers preparing for their dream job.