InterviewStack.io LogoInterviewStack.io

Data Cleaning and Business Logic Edge Cases Questions

Covers handling data centric edge cases and complex business rule interactions in queries and data pipelines. Topics include cleaning and normalizing data, handling nulls and type mismatches, deduplication strategies, treating inconsistent or malformed records, validating results and detecting anomalies, using conditional logic for data transformation, understanding null semantics in SQL, and designing queries that correctly implement date boundaries and domain specific business rules. Emphasis is on producing robust results in the presence of imperfect data and complex requirements.

EasyTechnical
0 practiced
You have a country field with values such as 'US', 'usa', 'United States', 'U.S.' and 'United States of America'. As a data analyst responsible for cleaning, outline how you would implement a mapping solution in the data warehouse to standardize countries. Include the creation of mapping tables, rules for fallback matches, and a schema for capturing unmapped values for review.
HardSystem Design
0 practiced
You must generate deterministic surrogate keys for canonicalized customer records across multiple ingest sources to avoid collisions and allow reversible auditing. Design an ID generation scheme that is deterministic, collision-resistant, reversible for authorized auditors, and performs at scale. Address namespace, hashing, and mapping storage considerations.
HardTechnical
0 practiced
Discuss the trade-offs between failing fast (rejecting malformed records) versus best-effort processing (ingest and clean later) in a high-throughput streaming ingestion pipeline. Then design a hybrid error-handling approach that minimizes data loss, preserves low latency, and allows later remediation for quarantined records.
MediumTechnical
0 practiced
Define a data-quality SLA for a nightly ETL load that produces the core sales fact table. Specify the key metrics you would track (e.g., row count, null-rate on critical columns, percent change vs baseline, freshness), threshold values for alerts, who gets alerted at each severity, and how you would escalate an incident when an SLA is missed.
MediumTechnical
0 practiced
You need to match product SKUs between two systems where SKUs have inconsistent formatting: dashes, leading zeros, and small typos. Explain a practical approach for matching including normalization, tokenization, fuzzy matching algorithms (Levenshtein, Jaro-Winkler), blocking strategies to reduce comparisons, and how you'd evaluate precision and recall for the matching.

Unlock Full Question Bank

Get access to hundreds of Data Cleaning and Business Logic Edge Cases interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.