pandas for Data Analysis: A Getting-Started Guide
pandas for Data Analysis: A Getting-Started Guide
If you are an early-career data analyst in India and you already know Excel and a bit of SQL, pandas is your next big leap. It is the Python library that powers most data analysis work — cleaning, transforming, aggregating, and exploring data far faster than spreadsheets, and on datasets too large for Excel to open. This guide gets you from zero to genuinely useful, using sales data of the kind you would handle at a Flipkart seller account or a kirana store going digital.
Installing and Importing
If you have Python installed, get pandas with:
pip install pandas
By universal convention, import it as pd:
import pandas as pd
Every pandas tutorial, Stack Overflow answer, and codebase uses pd, so stick with it. If you would rather not install anything yet, open a free Google Colab notebook — pandas is already installed there, and you can run every snippet below in your browser.
The Two Core Objects: Series and DataFrame
pandas has two data structures you must understand, and almost everything else builds on them.
A Series is a one-dimensional labelled array — think of a single column.
sales = pd.Series([1200, 3400, 850, 2100], name="amount")
print(sales)
# 0 1200
# 1 3400
# 2 850
# 3 2100
# Name: amount, dtype: int64
The numbers on the left (0, 1, 2, 3) are the index — labels for each value. The dtype (here int64) tells you what kind of data the Series holds, which matters a lot once you start filtering and aggregating.
A DataFrame is a two-dimensional table — multiple columns sharing one index. This is what you will use 95% of the time. You can build one from a dictionary:
df = pd.DataFrame({
"order_id": [101, 102, 103, 104],
"city": ["Mumbai", "Pune", "Mumbai", "Delhi"],
"amount": [1200, 3400, 850, 2100],
"payment": ["UPI", "Card", "COD", "UPI"]
})
Each column is really a Series; a DataFrame is a collection of Series that line up on the same index. When you pull out one column you get a Series back, and when you keep several columns you get a smaller DataFrame. Holding that picture in your head makes the rest of pandas click.
Reading Real Data with read_csv
In practice your data lives in files, usually a CSV exported from a billing app, a marketplace dashboard, or your own SQL query. read_csv is the function you will run most:
df = pd.read_csv("flipkart_orders.csv")
Real CSVs are messy, so a few options earn their keep on almost every project:
df = pd.read_csv(
"flipkart_orders.csv",
parse_dates=["order_date"], # treat this column as real dates
dtype={"pincode": str}, # keep pincodes as text, not numbers
thousands=",", # parse "1,200" as 1200
na_values=["NA", "-", ""] # treat these strings as missing
)
Parsing dates and keeping identifiers like pincodes as strings up front saves hours of cleanup later — a pincode read as a number loses any leading zero and cannot be joined cleanly. pandas also reads Excel (read_excel), JSON (read_json), and SQL tables (read_sql) with the same pattern, so the skill transfers directly to whatever format your data arrives in.
Inspecting Your Data
Never start analysis blind. Four commands tell you almost everything about a new dataset.
head() shows the first rows (default 5), and tail() the last:
df.head()
df.head(10) # first 10
df.tail() # last rows
info() reveals column names, non-null counts, and data types — your first stop for spotting missing values and wrong types:
df.info()
# RangeIndex: 50000 entries
# Column Non-Null Count Dtype
# order_id 50000 non-null int64
# city 49980 non-null object
# amount 50000 non-null float64
# order_date 50000 non-null datetime64[ns]
Here city has 49,980 non-null values out of 50,000 — twenty are missing. object usually means text.
describe() gives summary statistics for numeric columns — count, mean, standard deviation, min, the quartiles, and max:
df.describe()
And dtypes is a quick way to confirm every column was read as the type you expected:
df.dtypes
# order_id int64
# city object
# amount float64
# order_date datetime64[ns]
Also handy: df.shape (rows, columns) and df.columns (column names). Run these every single time you open a new file — it is the cheapest insurance against silly mistakes downstream.
Selecting Columns and Rows
Pick one column (returns a Series):
df["amount"]
Pick multiple columns (returns a DataFrame — note the double brackets):
df[["city", "amount"]]
Select rows by label with .loc and by position with .iloc:
df.loc[0] # the row with index label 0
df.loc[0, "amount"] # one cell
df.iloc[0] # the first row by position
df.iloc[0:5] # first five rows
df.loc[0:5, ["city", "amount"]] # rows and chosen columns
The .loc vs .iloc distinction confuses beginners: .loc uses your index labels, .iloc uses integer positions. With a default 0, 1, 2 index they look the same, but they diverge the moment you filter rows or set a custom index, so it is worth learning both now.
Filtering Rows with Boolean Masks
This is where pandas shines. You build a boolean condition — a Series of True/False values called a mask — and put it inside the brackets:
# Orders above Rs 2000
big_orders = df[df["amount"] > 2000]
# UPI orders in Mumbai (combine with & and wrap each condition in parentheses)
mumbai_upi = df[(df["city"] == "Mumbai") & (df["payment"] == "UPI")]
# Orders from a list of cities
metros = df[df["city"].isin(["Mumbai", "Delhi", "Bengaluru"])]
Two beginner traps: you must use &, |, ~ (not and, or, not), and each condition needs its own parentheses because of operator precedence. Get these two right and filtering becomes second nature.
Creating and Transforming Columns
Add a derived column with simple assignment:
df["amount_with_gst"] = df["amount"] * 1.18 # add 18% GST
df["is_high_value"] = df["amount"] > 2000 # boolean flag
For text or conditional transforms, use vectorised string methods or apply:
df["city_upper"] = df["city"].str.upper()
df["tier"] = df["amount"].apply(lambda x: "High" if x > 2000 else "Low")
Prefer vectorised operations (df["amount"] * 1.18) over apply with a Python function when you can — they run far faster on big data because the work happens in optimised C under the hood.
Handling Missing Values
Real datasets always have gaps — a blank city, a missing amount, a customer who never entered a pincode. pandas marks these as NaN. First, find them:
df.isna().sum() # count missing values per column
Then decide what to do. Drop rows that are missing critical fields, or fill gaps with a sensible default:
df = df.dropna(subset=["amount"]) # drop rows with no amount
df["city"] = df["city"].fillna("Unknown") # fill missing city labels
df["discount"] = df["discount"].fillna(0) # assume no discount
There is no single right answer — dropping is safer when the missing value would distort a total, while filling keeps rows you still need. The key habit is to look at the gaps deliberately rather than letting them silently break a later calculation.
Grouping and Aggregating
groupby is the pandas equivalent of SQL's GROUP BY, and it is the single most useful tool for turning raw orders into business answers. Total revenue per city:
df.groupby("city")["amount"].sum()
# city
# Delhi 2100
# Mumbai 2050
# Pune 3400
# Name: amount, dtype: int64
Several aggregates at once — total, average, and order count per city:
df.groupby("city")["amount"].agg(["sum", "mean", "count"])
Group by two columns, say city and payment mode, to see how UPI adoption varies across metros:
df.groupby(["city", "payment"])["amount"].sum()
This single line answers questions that would take many pivot-table clicks in Excel, and it scales to millions of rows without slowing down.
Sorting and Counting
Once you have a result, you usually want it ordered. sort_values ranks rows by a column:
# Biggest orders first
df.sort_values("amount", ascending=False).head(10)
# Top cities by revenue
city_revenue = df.groupby("city")["amount"].sum()
city_revenue.sort_values(ascending=False)
And value_counts is the fastest way to see how often each category appears — perfect for a kirana store checking which payment mode customers prefer:
df["payment"].value_counts()
# UPI 2
# Card 1
# COD 1
# Name: count, dtype: int64
df["payment"].value_counts(normalize=True) # as a share, e.g. 0.5 = 50%
That normalize=True turns raw counts into proportions, so you can report "half of all orders were paid by UPI" instead of just the count.
A pandas vs Excel / SQL Mental Model
If you come from Excel or SQL, you already know the concepts — pandas just renames them:
- A DataFrame is your worksheet or your SQL table. A column is a Series.
- Filtering rows (
df[df["amount"] > 2000]) is AutoFilter in Excel, orWHERE amount > 2000in SQL. - A derived column (
df["gst"] = df["amount"] * 0.18) is a formula dragged down a column, or a computed field inSELECT. groupby(...).sum()is a pivot table, orGROUP BY ... SUM(...)in SQL.merge(not covered here) isVLOOKUP, or a SQLJOIN.
The big differences are scale and repeatability. Excel struggles past a few hundred thousand rows and breaks the moment you reopen the file with new data; a pandas script runs the same steps on a fresh CSV in seconds, every time. Compared with SQL, pandas keeps the whole pipeline — load, clean, transform, chart — in one place, so you are not bouncing between a database and a spreadsheet. Think of pandas as Excel's power plus SQL's logic, written as code you can re-run and trust.
Best Practices for Beginners
- Always run
head(),info(), anddtypesbefore anything else on a new file. - Parse dates and fix dtypes at
read_csvtime, not later. - Check missing values with
df.isna().sum()early, and handle them on purpose. - Use vectorised operations over
apply/loops for speed. - Wrap each filter condition in parentheses and use
&/|, notand/or. - Avoid chained assignment like
df[df.x > 1]["y"] = 0; it triggers warnings and silent bugs. Use.locinstead.
FAQ
Is pandas hard to learn?
Not especially. If you can already build formulas and pivot tables in Excel, the ideas in pandas will feel familiar — you are mostly learning new names and a code-based way to express them. The trickiest early hurdles are .loc vs .iloc and boolean filtering syntax, and both click within a week of daily practice. Most learners are productive in two to three weeks of consistent effort.
Do data analysts need Python?
For many entry-level roles in India, strong Excel and SQL are enough to get hired. But Python with pandas is increasingly expected and is the clearest way to stand out, automate repetitive work, and handle datasets that crash a spreadsheet. If you want to grow toward senior analyst, analytics-engineer, or data-science roles, learning pandas now is one of the highest-return things you can do.
pandas vs Excel — which should I use?
Use whichever fits the job. Excel is unbeatable for quick ad-hoc looks, small datasets, and sharing a sheet with non-technical colleagues. Reach for pandas when the data is large, when you will repeat the same cleaning steps regularly, or when you need an auditable, re-runnable pipeline. In practice analysts use both — explore in Excel, then move serious or recurring work into pandas.
Is pandas the same as Python?
No. Python is the programming language; pandas is a library you install on top of it for working with tabular data. You write ordinary Python and call pandas functions through the pd alias. You need only basic Python — variables, lists, and functions — to be effective with pandas.
How long does it take to learn pandas?
With focused daily practice on real datasets, most beginners reach a comfortable working level in three to four weeks: loading data, inspecting it, filtering, deriving columns, and grouping. Mastery of the trickier corners — multi-index frames, time series, performance tuning — takes longer, but you do not need any of that to start delivering useful analysis.
That is the core toolkit. With Series, DataFrames, read_csv, the inspection commands, selection, filtering, missing-value handling, groupby, sort_values, and value_counts, you can already replace most of what you did in Excel — and tackle data ten times the size.
Related: pandas groupby Explained (with Real Examples) · Practise Python
Don't just read. Prove your skill on DevWithData.
Shashikant
· Founder, DevWithDataData professional and Power BI instructor. Building DevWithData to help analysts prove their skills, not just collect certificates.
Reading is not enough. Prove your skill.
DevWithData measures your actual ability with the Data Readiness Index. Stop reading — start practicing.
Continue Learning
Data Analyst Portfolio Projects Using Indian Datasets
Recruiters in India see hundreds of resumes listing the same skills. A portfolio of real projects on relatable Indian datasets is what makes you memorable. This guide gives you three end-to-end project ideas built around kirana sales, UPI transactions and cab demand, each with the business question, the data, the SQL/Power BI/Python steps, and the insights to highlight, so you can build a portfolio that actually gets you shortlisted.
9 min read12 SQL Interview Puzzles with Solutions
Twelve SQL puzzles that show up again and again in Indian data analyst interviews, each with a clean, worked solution and the reasoning behind it. Covers duplicates, second-highest values, running totals, gaps and islands, self-joins, top-N per group, and more. Built on relatable e-commerce and HR tables so you can practice the patterns, not just memorise answers.
8 min readThe Second-Highest Salary SQL Problem (5 Ways to Solve It)
The second-highest salary question is asked in almost every entry-level SQL interview in India. This post solves it five different ways: subquery with MAX, LIMIT/OFFSET, DENSE_RANK, a correlated subquery, and a CTE. Each approach is explained with its trade-offs around ties, NULLs, and what happens when there is no second salary, so you can pick the right one under pressure.
8 min read