python

pandas

eda

data-analysis

interview-prep

Exploratory Data Analysis (EDA) Workflow with pandas

By Shashikant·29 June 2026·4 min read

Exploratory Data Analysis (EDA) Workflow with pandas

You have loaded your data. Now what? Jumping straight into charts or a machine-learning model is how analysts get burned by silent data-quality problems. Exploratory Data Analysis, or EDA, is the disciplined step where you get to know your dataset first.

This guide gives you a repeatable pandas EDA workflow you can run on almost any dataset. We will use a fictional e-commerce orders table from an Indian D2C brand, with columns like order_id, city, category, amount, payment_mode and order_date.

Step 1: Understand the shape and structure

Always start with the big picture: how big is the data and what types are the columns.

import pandas as pd

df = pd.read_csv("orders.csv", parse_dates=["order_date"])

print(df.shape)        # (rows, columns)
print(df.info())       # dtypes, non-null counts, memory
print(df.head())       # eyeball the first rows
print(df.columns.tolist())

df.info() is the single most useful first command. It tells you the row count, every column's data type, and how many non-null values each column has — which is your first hint about missing data. If amount shows up as object instead of float, you probably have stray text or commas to clean.

Step 2: Summary statistics with describe

describe() gives you the statistical shape of your numeric columns in one line.

print(df.describe())

You get count, mean, standard deviation, min, the quartiles, and max. Read it carefully:

If min for amount is negative, you have refunds or data errors.
If max is wildly larger than the 75th percentile, you have outliers (more on that below).
If count is lower than your row count, that column has missing values.

For text columns, add include="object" to see counts, unique values, and the most frequent value.

print(df.describe(include="object"))

Step 3: Explore categories with value_counts

For every categorical column, value_counts() shows you the distribution. This catches dirty data instantly.

print(df["city"].value_counts())
print(df["payment_mode"].value_counts(normalize=True))  # as proportions

In real Indian datasets you will often see "Bengaluru", "Bangalore", and "bangalore " as three separate values — the same city written three ways. value_counts() exposes this so you can standardise. normalize=True is great for quick insights, like "62% of orders are UPI, 28% COD."

For high-cardinality columns, look at just the top values.

print(df["category"].value_counts().head(10))

Step 4: Find missing values

Missing data quietly distorts every metric. Quantify it before you do anything else.

missing = df.isnull().sum()
missing_pct = (df.isnull().mean() * 100).round(2)

summary = pd.DataFrame({"missing": missing, "percent": missing_pct})
print(summary[summary["missing"] > 0])

Now decide a strategy per column:

Drop rows if very few are missing and they are not critical: df.dropna(subset=["amount"]).
Fill with a sensible value: df["city"] = df["city"].fillna("Unknown").
Impute numerics with median (robust to outliers): df["amount"].fillna(df["amount"].median()).

Never silently drop large chunks of data without understanding why it is missing.

Step 5: Detect outliers

Outliers can be genuine (a bulk corporate order) or errors (a misplaced decimal). The IQR method is a simple, defensible approach.

q1 = df["amount"].quantile(0.25)
q3 = df["amount"].quantile(0.75)
iqr = q3 - q1

low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr

outliers = df[(df["amount"] < low) | (df["amount"] > high)]
print(f"Found {len(outliers)} outlier orders")

Investigate before deleting. A ₹2,00,000 order might be a real wedding-season bulk purchase, not a typo.

Step 6: Examine relationships with correlations

For numeric columns, corr() shows how variables move together.

print(df.corr(numeric_only=True))

Values range from -1 to +1. A correlation near +1 means two columns rise together; near -1 means one rises as the other falls; near 0 means little linear relationship. For example, quantity and amount will usually correlate strongly. Remember the classic caution: correlation is not causation.

Step 7: Group and slice for insight

EDA is not only statistics — it is asking simple business questions.

# Average order value by city
print(df.groupby("city")["amount"].mean().sort_values(ascending=False))

# Monthly revenue trend
df["month"] = df["order_date"].dt.to_period("M")
print(df.groupby("month")["amount"].sum())

# Payment mode split by category
print(pd.crosstab(df["category"], df["payment_mode"]))

These three commands alone can surface where your revenue concentrates, whether sales are growing, and how customers pay across categories.

A reusable EDA checklist

Run through this on every new dataset:

shape, info, head — size and structure
describe — numeric and object summaries
value_counts — category distributions and dirty values
isnull().sum() — missing data and a strategy
IQR check — outliers, investigated not deleted
corr — numeric relationships
groupby / crosstab — answer real business questions

Interview pointers

Explain why EDA comes before modelling or dashboarding.
Know the difference between describe() for numeric vs object columns.
Be ready to describe at least one outlier-detection method (IQR).
Stress that you investigate anomalies before deleting them, and that correlation is not causation.

EDA is where good analysts earn their reputation. The numbers on a dashboard are only as trustworthy as the exploration behind them. Make this workflow a habit and you will catch problems before your stakeholders do.

Don't just read. Prove your skill on DevWithData.

Shashikant

· Founder, DevWithData

Data professional and Power BI instructor. Building DevWithData to help analysts prove their skills, not just collect certificates.

View Profile LinkedIn YouTube

Reading is not enough. Prove your skill.

DevWithData measures your actual ability with the Data Readiness Index. Stop reading — start practicing.

Continue Learning

pandas for Data Analysis: A Getting-Started Guide

pandas is the workhorse of data analysis in Python, and learning it well is the fastest way to move beyond Excel. This beginner guide covers Series and DataFrames, reading a CSV with read_csv, inspecting your data with head, info and describe, selecting rows and columns, filtering, and basic grouping. Hands-on examples use Indian sales data so you can follow along.

8 min read

matplotlib & seaborn Basics for Data Analysts

Charts turn analysis into insight, and matplotlib plus seaborn are the duo every Indian data analyst should know. This guide covers the core charts - line, bar, histogram, scatter, box - and exactly when to use each, plus how seaborn builds on matplotlib for cleaner statistical plots. Examples visualize Zomato monthly revenue and cab fares in rupees. Learn clean styling, labeling, and the figure/axes model that interviewers expect.

9 min read

The Data Analyst Resume Guide (with Examples)

Your resume has about six seconds to survive a recruiter's screen and an ATS filter before anyone reads it properly. This guide shows Indian data analysts how to build a resume that passes both: a tools-first skills section, quantified bullet points with real impact, a projects section that proves your skills, and ATS-friendly formatting. Includes before/after bullet examples and a clean section-by-section template you can copy.

9 min read