Exploratory Data Analysis Questions

Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.

MediumTechnical

0 practiced

Design a reusable EDA template or notebook in Python that, given a pandas DataFrame, produces a standardized profiling report including: inferred schema/type hints, missingness matrix, per-column descriptive stats, histograms, top categories, pairwise correlations, and notable anomalies. Which libraries would you use, how would you organize outputs, and how would you ensure reproducibility and CI integration?

MediumTechnical

0 practiced

Given a table events(user_id, event_date DATE, revenue DECIMAL), write ANSI SQL using window functions to compute a 7-day rolling average revenue per user and flag days where the daily revenue is greater than rolling_mean + 2 * rolling_stddev for that user. Include partitioning by user and appropriate window framing.

HardTechnical

0 practiced

Design and implement a reproducible EDA pipeline in Python that produces versioned profiling artifacts for each dataset snapshot. Describe directory structure, use of data versioning (DVC or Delta Lake), deterministic sampling seeds, logging, unit tests for distributional checks, and strategies to mask or redact PII while preserving useful aggregate statistics.

EasyTechnical

0 practiced

You observe a numeric column 'purchase_amount' with a heavy right skew and a long tail. During EDA, list the steps you would take to visualize and quantify skewness, identify extreme outliers, and prepare the variable for modeling. Discuss transformations (log, Box-Cox), winsorization, binning, and when to prefer each approach.

EasyTechnical

0 practiced

Write a Python function using pandas that, given a DataFrame and a list of key columns, returns: (a) the number of duplicate groups (by keys), (b) a sample of duplicate rows, and (c) a deduplicated DataFrame keeping the first occurrence. Provide function signature, edge-case handling, and a brief usage example.

Unlock Full Question Bank

Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.

Join thousands of developers preparing for their dream job.