InterviewStack.io LogoInterviewStack.io

Data Quality and Anomaly Detection Questions

Focuses on identifying, diagnosing, and preventing data issues that produce misleading or incorrect metrics. Topics include spotting duplicates, missing values, schema drift, logical inconsistencies, extreme outliers caused by instrumentation bugs, data latency and pipeline failures, and reconciliation differences between sources. Covers validation strategies such as data tests, checksums, row counts, data contracts, invariants, and automated alerting for quality metrics like completeness, accuracy, and timeliness. Also addresses investigation workflows to determine whether anomalies are data problems versus true business signals, documenting remediation steps, and collaborating with engineering and product teams to fix upstream causes.

MediumTechnical
0 practiced
Given a pandas DataFrame sample with columns `id, name, email, updated_at, completeness_score`, write a Python function that groups potential duplicates by `email` and selects a canonical record per group by highest completeness_score then most recent updated_at. Return the canonical rows and a table of merged ids. Provide code or clear pseudocode.
HardTechnical
0 practiced
Propose an automated remediation framework for common data-quality issues: missing partitions, negative amounts, and duplicate events. For each issue describe automatic fix logic (if safe), backfill strategy, risk assessment, audit logging, and criteria for when to require human review instead of auto-fixing.
MediumSystem Design
0 practiced
Design a data quality dashboard in Tableau or Power BI for monitoring table-level quality across the analytics ecosystem. Specify layout and components (KPIs, trend charts, top offending tables, drilldowns to sample rows), user personas (data engineer vs product manager), and how drilldowns should surface root cause information.
HardTechnical
0 practiced
Leadership/incident-management: You're responsible for a data-quality incident where several executive dashboards show incorrect KPIs. Outline an incident response plan covering immediate triage, assigning owners, temporary mitigations (freeze or annotation), cross-team communication, remediation steps, and postmortem actions including documentation and preventive measures.
HardTechnical
0 practiced
You manage a small analytics team using Snowflake + dbt with limited budget. Propose a pragmatic prioritized set of automated data-quality checks to deploy first. For each check indicate whether it runs at ingest (near real-time) or nightly, the expected compute cost implication, and why it's prioritized (impact vs cost).

Unlock Full Question Bank

Get access to hundreds of Data Quality and Anomaly Detection interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.