Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.
MediumSystem Design
0 practiced
Design a set of automated data quality checks (think of them as unit tests) that would run nightly on a production ETL before allowing EDA and reporting to proceed. Include checks for schema drift, null rate thresholds, distributional changes, cardinality spikes, and referential integrity. For each check describe the metric, a detection method, and an alerting strategy to avoid alert fatigue.
MediumTechnical
0 practiced
A date column in your dataset contains inconsistent string formats such as '2024-01-05', 'Jan 5, 2024', and '05/01/2024'. Outline a robust, reproducible approach in Python or SQL to normalize this column to a canonical timestamp, detect unparseable rows, and create a validation report capturing parsing failures and their possible causes.
MediumSystem Design
0 practiced
Explain how you would detect data drift over time for a key metric such as average basket size and propose an alerting strategy that balances sensitivity and false alarms. Include statistical tests, rolling baselines, and how to handle seasonality and holidays in your approach.
EasyTechnical
0 practiced
You receive a numeric column in a dataset with values including sentinel codes and obvious errors, for example: [100, 102, NaN, 105, -999, 108]. Describe step-by-step how you would detect sentinel values and outliers during EDA, and list at least three actions you might take for sentinel or error values versus legitimate outliers. Which visualizations and summary statistics would you use to support your decisions?
EasyTechnical
0 practiced
Describe step-by-step how to create a pivot table in Excel to show monthly revenue per product category from columns: date, product_category, revenue. Explain how to add a computed field to calculate month-over-month percentage change and how to use conditional formatting to highlight decreases greater than 10%. Mention any limitations to be aware of when using Excel for repeated EDA.
Unlock Full Question Bank
Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.