Data Manipulation and Transformation Questions

Encompasses techniques and best practices for cleaning, transforming, and preparing data for analysis and production systems. Candidates should be able to handle missing values, duplicates, inconsistency resolution, normalization and denormalization, data typing and casting, and validation checks. Expect discussion of writing robust code that handles edge cases such as empty datasets and null values, defensive data validation, unit and integration testing for transformations, and strategies for performance and memory efficiency. At more senior levels include design of scalable, debuggable, and maintainable data pipelines and transformation architectures, idempotency, schema evolution, batch versus streaming trade offs, observability and monitoring, versioning and reproducibility, and tool selection such as SQL, pandas, Spark, or dedicated ETL frameworks.

EasyTechnical

0 practiced

Describe best practices for handling empty datasets gracefully in scheduled transformations and reports. Cover defensive coding patterns, idempotency, example SQL checks to short-circuit logic, how to surface 'no data' to stakeholders (explicit empty-state vs previous data), and how to include tests for empty input in CI.

MediumTechnical

0 practiced

Product stakeholders ask for near real-time analytics. Compare batch (including micro-batch) and true streaming approaches across latency, implementation complexity, cost, completeness vs accuracy, data guarantees (exactly-once vs at-least-once), and operational burden. Recommend an approach for dashboards that tolerate up to 1 minute latency and explain trade-offs.

MediumTechnical

0 practiced

How would you design unit tests and integration tests for data transformations implemented as SQL or pandas functions? Provide concrete examples: a) SQL test that asserts row counts and expected values on a small fixture table, and b) pandas unit test verifying behavior when inputs contain nulls. Also describe strategies for test data management and CI integration.

EasyTechnical

0 practiced

Given a table 'orders' with schema orders(order_id bigint, customer_id bigint, order_date timestamp, amount decimal(10,2)), write a PostgreSQL query that finds customers who have multiple orders with the same order_date (considered duplicates). The query should return customer_id, order_date, count_of_orders, and an array/list of order_ids. Explain how the query handles NULL order_date and empty tables.

MediumSystem Design

0 practiced

You maintain a Parquet-based data lake consumed by multiple teams. Describe a strategy to handle schema evolution when producers add, remove, or rename columns. Discuss backward/forward compatibility, nullable defaults, column renames and migration plans, use of schema registries (Avro/Protobuf), and consumer-side defensive parsing.

Unlock Full Question Bank

Get access to hundreds of Data Manipulation and Transformation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.