Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.
MediumTechnical
0 practiced
Your model's performance on a held-out test set is unexpectedly high. During EDA, outline a checklist of checks and experiments to detect possible data leakage: compare feature distributions across train/val/test, look for duplicate rows across splits, verify time-based splits, run simple models on single features, and perform permutation tests. Explain how each check helps identify leakage.
MediumTechnical
0 practiced
You are analyzing an object detection dataset with bounding boxes and class labels. What EDA steps would you perform to understand instance-level and image-level class balance, bounding box area and aspect-ratio distributions, occlusion/missing boxes, counts per image, and label overlap? Describe visualizations (violin/histogram/heatmap) and derived statistics, and how findings would drive augmentation or sampling strategies.
HardTechnical
0 practiced
You inherit a named-entity recognition (NER) dataset annotated in BIO format but find overlapping spans, inconsistent entity types, and non-canonical whitespace. Outline an EDA plan to detect schema violations and quantify their frequency (e.g., percent of examples with overlapping spans), provide pandas/regex checks to detect common issues, and recommend remediation steps (automatic normalization, rule-based fixes, or re-annotation/adjudication).
EasyTechnical
0 practiced
You discover that about 15% of rows in a tabular training dataset look duplicated. Describe the steps and SQL/pandas queries you would use to detect exact duplicates and near-duplicates, how you would confirm whether duplicates are legitimate repetitions or errors (use timestamps, IDs, business rules), and how you'd decide which duplicates to remove or keep for model training.
EasyTechnical
0 practiced
Explain the difference between Missing Completely at Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) in the context of labels and sensor data for AI applications. Provide concrete examples (e.g., sensor dropout, biased survey responses), explain implications for imputation and modeling, and list diagnostic checks you would run during EDA to infer the missingness mechanism.
Unlock Full Question Bank
Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.