Exploratory Data Analysis (EDA) Workflow with pandas
Exploratory Data Analysis (EDA) Workflow with pandas
You have loaded your data. Now what? Jumping straight into charts or a machine-learning model is how analysts get burned by silent data-quality problems. Exploratory Data Analysis, or EDA, is the disciplined step where you get to know your dataset first.
This guide gives you a repeatable pandas EDA workflow you can run on almost any dataset. We will use a fictional e-commerce orders table from an Indian D2C brand, with columns like order_id, city, category, amount, payment_mode and order_date.
Step 1: Understand the shape and structure
Always start with the big picture: how big is the data and what types are the columns.
import pandas as pd
df = pd.read_csv("orders.csv", parse_dates=["order_date"])
print(df.shape) # (rows, columns)
print(df.info()) # dtypes, non-null counts, memory
print(df.head()) # eyeball the first rows
print(df.columns.tolist())
df.info() is the single most useful first command. It tells you the row count, every column's data type, and how many non-null values each column has — which is your first hint about missing data. If amount shows up as object instead of float, you probably have stray text or commas to clean.
Step 2: Summary statistics with describe
describe() gives you the statistical shape of your numeric columns in one line.
print(df.describe())
You get count, mean, standard deviation, min, the quartiles, and max. Read it carefully:
- If
minforamountis negative, you have refunds or data errors. - If
maxis wildly larger than the 75th percentile, you have outliers (more on that below). - If
countis lower than your row count, that column has missing values.
For text columns, add include="object" to see counts, unique values, and the most frequent value.
print(df.describe(include="object"))
Step 3: Explore categories with value_counts
For every categorical column, value_counts() shows you the distribution. This catches dirty data instantly.
print(df["city"].value_counts())
print(df["payment_mode"].value_counts(normalize=True)) # as proportions
In real Indian datasets you will often see "Bengaluru", "Bangalore", and "bangalore " as three separate values — the same city written three ways. value_counts() exposes this so you can standardise. normalize=True is great for quick insights, like "62% of orders are UPI, 28% COD."
For high-cardinality columns, look at just the top values.
print(df["category"].value_counts().head(10))
Step 4: Find missing values
Missing data quietly distorts every metric. Quantify it before you do anything else.
missing = df.isnull().sum()
missing_pct = (df.isnull().mean() * 100).round(2)
summary = pd.DataFrame({"missing": missing, "percent": missing_pct})
print(summary[summary["missing"] > 0])
Now decide a strategy per column:
- Drop rows if very few are missing and they are not critical:
df.dropna(subset=["amount"]). - Fill with a sensible value:
df["city"] = df["city"].fillna("Unknown"). - Impute numerics with median (robust to outliers):
df["amount"].fillna(df["amount"].median()).
Never silently drop large chunks of data without understanding why it is missing.
Step 5: Detect outliers
Outliers can be genuine (a bulk corporate order) or errors (a misplaced decimal). The IQR method is a simple, defensible approach.
q1 = df["amount"].quantile(0.25)
q3 = df["amount"].quantile(0.75)
iqr = q3 - q1
low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr
outliers = df[(df["amount"] < low) | (df["amount"] > high)]
print(f"Found {len(outliers)} outlier orders")
Investigate before deleting. A ₹2,00,000 order might be a real wedding-season bulk purchase, not a typo.
Step 6: Examine relationships with correlations
For numeric columns, corr() shows how variables move together.
print(df.corr(numeric_only=True))
Values range from -1 to +1. A correlation near +1 means two columns rise together; near -1 means one rises as the other falls; near 0 means little linear relationship. For example, quantity and amount will usually correlate strongly. Remember the classic caution: correlation is not causation.
Step 7: Group and slice for insight
EDA is not only statistics — it is asking simple business questions.
# Average order value by city
print(df.groupby("city")["amount"].mean().sort_values(ascending=False))
# Monthly revenue trend
df["month"] = df["order_date"].dt.to_period("M")
print(df.groupby("month")["amount"].sum())
# Payment mode split by category
print(pd.crosstab(df["category"], df["payment_mode"]))
These three commands alone can surface where your revenue concentrates, whether sales are growing, and how customers pay across categories.
A reusable EDA checklist
Run through this on every new dataset:
shape,info,head— size and structuredescribe— numeric and object summariesvalue_counts— category distributions and dirty valuesisnull().sum()— missing data and a strategy- IQR check — outliers, investigated not deleted
corr— numeric relationshipsgroupby/crosstab— answer real business questions
Interview pointers
- Explain why EDA comes before modelling or dashboarding.
- Know the difference between
describe()for numeric vs object columns. - Be ready to describe at least one outlier-detection method (IQR).
- Stress that you investigate anomalies before deleting them, and that correlation is not causation.
EDA is where good analysts earn their reputation. The numbers on a dashboard are only as trustworthy as the exploration behind them. Make this workflow a habit and you will catch problems before your stakeholders do.
Related: matplotlib & seaborn Basics for Data Analysts · Practise Python
Don't just read. Prove your skill on DevWithData.
Shashikant
· Founder, DevWithDataData professional and Power BI instructor. Building DevWithData to help analysts prove their skills, not just collect certificates.
Reading is not enough. Prove your skill.
DevWithData measures your actual ability with the Data Readiness Index. Stop reading — start practicing.
Continue Learning
pandas for Data Analysis: A Getting-Started Guide
pandas is the workhorse of data analysis in Python, and learning it well is the fastest way to move beyond Excel. This beginner guide covers Series and DataFrames, reading a CSV with read_csv, inspecting your data with head, info and describe, selecting rows and columns, filtering, and basic grouping. Hands-on examples use Indian sales data so you can follow along.
8 min readmatplotlib & seaborn Basics for Data Analysts
Charts turn analysis into insight, and matplotlib plus seaborn are the duo every Indian data analyst should know. This guide covers the core charts - line, bar, histogram, scatter, box - and exactly when to use each, plus how seaborn builds on matplotlib for cleaner statistical plots. Examples visualize Zomato monthly revenue and cab fares in rupees. Learn clean styling, labeling, and the figure/axes model that interviewers expect.
9 min readThe Data Analyst Resume Guide (with Examples)
Your resume has about six seconds to survive a recruiter's screen and an ATS filter before anyone reads it properly. This guide shows Indian data analysts how to build a resume that passes both: a tools-first skills section, quantified bullet points with real impact, a projects section that proves your skills, and ATS-friendly formatting. Includes before/after bullet examples and a clean section-by-section template you can copy.
9 min read