Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

HardSystem Design

0 practiced

Explain how to implement incremental backfills for computed aggregated tables when upstream corrections can arrive. Cover idempotent operations, partial re-computation strategies, checkpointing, and minimizing recompute costs (e.g., using change data capture and windowed re-aggregation).

MediumTechnical

0 practiced

You maintain a Python data transformation that normalizes column names, casts types, and drops PII. Write a plan and example pytest unit tests for this transformation: include fixtures for small input DataFrames, edge cases (empty DataFrame, missing columns, nulls), and assertions to verify behavior and error handling.

MediumTechnical

0 practiced

Given transactional data (user_id, amount, occurred_at), write a SQL or pandas transform to produce a per-user summary table with columns: total_spend, last_purchase_date, avg_purchase_interval_days. Show sample input and expected output and describe edge-case handling (single purchase, null dates).

EasyTechnical

0 practiced

In SQL (Postgres), given a table users(id integer primary key, name text, email text), write a query to find rows where email is NULL or an empty string or contains only whitespace. Then write an update statement to normalize empty or whitespace emails to NULL so downstream ETL treats them consistently. Mention any caveats for different SQL dialects.

MediumSystem Design

0 practiced

Design a deduplication strategy for streaming events produced with at-least-once semantics. Describe how you'd implement deduplication both in a streaming engine (e.g., Flink or Spark Structured Streaming) and as an offline batch job. Include use of event IDs, windowing, watermarking, and state TTL to bound memory.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.