Exploratory Data Analysis Questions

Exploratory Data Analysis is the systematic process of investigating and validating a dataset to understand its structure, content, and quality before modelling or reporting. Core activities include examining schema and data types, computing descriptive statistics such as counts, means, medians, standard deviations and quartiles, and measuring class balance and unique value counts. It covers distribution analysis, outlier detection, correlation and relationship exploration, and trend or seasonality checks for time series. Data validation and quality checks include identifying missing values, anomalies, inconsistent encodings, duplicates, and other data integrity issues. Practical techniques span SQL profiling and aggregation queries using GROUP BY, COUNT and DISTINCT; interactive data exploration with pandas and similar libraries; and visualization with histograms, box plots, scatter plots, heatmaps and time series charts to reveal patterns and issues. The process also includes feature summary and basic metric computation, sampling strategies, forming and documenting hypotheses, and recommending cleaning or transformation steps. Good Exploratory Data Analysis produces a clear record of findings, assumptions to validate, and next steps for cleaning, feature engineering, or modelling.

MediumTechnical

0 practiced

Implement a Python function that selects the top 5 numeric features by variance from a DataFrame df and produces a seaborn pairplot (scatter with regression lines and histograms) for those features. Ensure the function subsamples to at most 20k rows to keep plots readable and uses stratified sampling if a label column is provided. The function should save the plot to PNG.

MediumTechnical

0 practiced

Write PostgreSQL SQL that computes a 7-day moving average and flags anomaly days in table daily_metrics(date DATE, metric_value DOUBLE PRECISION) where the metric deviates by more than 3 standard deviations from the trailing 30-day mean. Return columns: date, metric_value, moving_avg_7d, trailing_mean_30d, trailing_std_30d, z_score, is_anomaly. Use window functions and explain performance considerations on large tables.

HardTechnical

0 practiced

Provide PySpark pseudocode or high-level implementation to compute for a large Parquet dataset on S3: exact distinct counts per column, null counts, top-k frequent values per column, and approximate quantiles for numeric columns. Describe handling of skewed keys, memory tuning, and trade-offs between exact and approximate methods (e.g., HyperLogLog, t-digest).

EasyTechnical

0 practiced

You have a numeric feature 'price'. As part of EDA list and justify at least four different visualizations or transformations you would use to understand its distribution and outliers (e.g., histogram, boxplot, log-transform). For each visualization explain what it reveals, what parameters you'd choose (bins, log scale), and when to use that approach.

EasyTechnical

0 practiced

Describe practical strategies to detect and handle duplicate records in a dataset. Include examples of exact duplicates, near-duplicates (fuzzy), and strategies for deduplication such as fingerprinting, hashing, blocking, and manual review. Mention how you would log and validate deduplication decisions during EDA.

Unlock Full Question Bank

Get access to hundreds of Exploratory Data Analysis interview questions and detailed answers.

Join thousands of developers preparing for their dream job.