Home|Blog|Exploratory Data Analysis (EDA) Workflow with pandas
Sign in →
python
pandas
eda
data-analysis
interview-prep

Exploratory Data Analysis (EDA) Workflow with pandas

By Shashikant·29 June 2026·4 min read

Exploratory Data Analysis (EDA) Workflow with pandas

You have loaded your data. Now what? Jumping straight into charts or a machine-learning model is how analysts get burned by silent data-quality problems. Exploratory Data Analysis, or EDA, is the disciplined step where you get to know your dataset first.

This guide gives you a repeatable pandas EDA workflow you can run on almost any dataset. We will use a fictional e-commerce orders table from an Indian D2C brand, with columns like order_id, city, category, amount, payment_mode and order_date.

Step 1: Understand the shape and structure

Always start with the big picture: how big is the data and what types are the columns.

import pandas as pd

df = pd.read_csv("orders.csv", parse_dates=["order_date"])

print(df.shape)        # (rows, columns)
print(df.info())       # dtypes, non-null counts, memory
print(df.head())       # eyeball the first rows
print(df.columns.tolist())

df.info() is the single most useful first command. It tells you the row count, every column's data type, and how many non-null values each column has — which is your first hint about missing data. If amount shows up as object instead of float, you probably have stray text or commas to clean.

Step 2: Summary statistics with describe

describe() gives you the statistical shape of your numeric columns in one line.

print(df.describe())

You get count, mean, standard deviation, min, the quartiles, and max. Read it carefully:

  • If min for amount is negative, you have refunds or data errors.
  • If max is wildly larger than the 75th percentile, you have outliers (more on that below).
  • If count is lower than your row count, that column has missing values.

For text columns, add include="object" to see counts, unique values, and the most frequent value.

print(df.describe(include="object"))

Step 3: Explore categories with value_counts

For every categorical column, value_counts() shows you the distribution. This catches dirty data instantly.

print(df["city"].value_counts())
print(df["payment_mode"].value_counts(normalize=True))  # as proportions

In real Indian datasets you will often see "Bengaluru", "Bangalore", and "bangalore " as three separate values — the same city written three ways. value_counts() exposes this so you can standardise. normalize=True is great for quick insights, like "62% of orders are UPI, 28% COD."

For high-cardinality columns, look at just the top values.

print(df["category"].value_counts().head(10))

Step 4: Find missing values

Missing data quietly distorts every metric. Quantify it before you do anything else.

missing = df.isnull().sum()
missing_pct = (df.isnull().mean() * 100).round(2)

summary = pd.DataFrame({"missing": missing, "percent": missing_pct})
print(summary[summary["missing"] > 0])

Now decide a strategy per column:

  • Drop rows if very few are missing and they are not critical: df.dropna(subset=["amount"]).
  • Fill with a sensible value: df["city"] = df["city"].fillna("Unknown").
  • Impute numerics with median (robust to outliers): df["amount"].fillna(df["amount"].median()).

Never silently drop large chunks of data without understanding why it is missing.

Step 5: Detect outliers

Outliers can be genuine (a bulk corporate order) or errors (a misplaced decimal). The IQR method is a simple, defensible approach.

q1 = df["amount"].quantile(0.25)
q3 = df["amount"].quantile(0.75)
iqr = q3 - q1

low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr

outliers = df[(df["amount"] < low) | (df["amount"] > high)]
print(f"Found {len(outliers)} outlier orders")

Investigate before deleting. A ₹2,00,000 order might be a real wedding-season bulk purchase, not a typo.

Step 6: Examine relationships with correlations

For numeric columns, corr() shows how variables move together.

print(df.corr(numeric_only=True))

Values range from -1 to +1. A correlation near +1 means two columns rise together; near -1 means one rises as the other falls; near 0 means little linear relationship. For example, quantity and amount will usually correlate strongly. Remember the classic caution: correlation is not causation.

Step 7: Group and slice for insight

EDA is not only statistics — it is asking simple business questions.

# Average order value by city
print(df.groupby("city")["amount"].mean().sort_values(ascending=False))

# Monthly revenue trend
df["month"] = df["order_date"].dt.to_period("M")
print(df.groupby("month")["amount"].sum())

# Payment mode split by category
print(pd.crosstab(df["category"], df["payment_mode"]))

These three commands alone can surface where your revenue concentrates, whether sales are growing, and how customers pay across categories.

A reusable EDA checklist

Run through this on every new dataset:

  1. shape, info, head — size and structure
  2. describe — numeric and object summaries
  3. value_counts — category distributions and dirty values
  4. isnull().sum() — missing data and a strategy
  5. IQR check — outliers, investigated not deleted
  6. corr — numeric relationships
  7. groupby / crosstab — answer real business questions

Interview pointers

  • Explain why EDA comes before modelling or dashboarding.
  • Know the difference between describe() for numeric vs object columns.
  • Be ready to describe at least one outlier-detection method (IQR).
  • Stress that you investigate anomalies before deleting them, and that correlation is not causation.

EDA is where good analysts earn their reputation. The numbers on a dashboard are only as trustworthy as the exploration behind them. Make this workflow a habit and you will catch problems before your stakeholders do.

Related: matplotlib & seaborn Basics for Data Analysts · Practise Python

Don't just read. Prove your skill on DevWithData.

DevWithData

Shashikant

· Founder, DevWithData

Data professional and Power BI instructor. Building DevWithData to help analysts prove their skills, not just collect certificates.

Reading is not enough. Prove your skill.

DevWithData measures your actual ability with the Data Readiness Index. Stop reading — start practicing.

Continue Learning