Home|Blog|pandas for Data Analysis: A Getting-Started Guide
Sign in →
python
pandas
data-analysis
beginner
dataframe

pandas for Data Analysis: A Getting-Started Guide

By Shashikant·27 June 2026·9 min read

pandas for Data Analysis: A Getting-Started Guide

If you are an early-career data analyst in India and you already know Excel and a bit of SQL, pandas is your next big leap. It is the Python library that powers most data analysis work — cleaning, transforming, aggregating, and exploring data far faster than spreadsheets, and on datasets too large for Excel to open. This guide gets you from zero to genuinely useful, using sales data of the kind you would handle at a Flipkart seller account or a kirana store going digital.

Installing and Importing

If you have Python installed, get pandas with:

pip install pandas

By universal convention, import it as pd:

import pandas as pd

Every pandas tutorial, Stack Overflow answer, and codebase uses pd, so stick with it. If you would rather not install anything yet, open a free Google Colab notebook — pandas is already installed there, and you can run every snippet below in your browser.

The Two Core Objects: Series and DataFrame

pandas has two data structures you must understand, and almost everything else builds on them.

A Series is a one-dimensional labelled array — think of a single column.

sales = pd.Series([1200, 3400, 850, 2100], name="amount")
print(sales)
# 0    1200
# 1    3400
# 2     850
# 3    2100
# Name: amount, dtype: int64

The numbers on the left (0, 1, 2, 3) are the index — labels for each value. The dtype (here int64) tells you what kind of data the Series holds, which matters a lot once you start filtering and aggregating.

A DataFrame is a two-dimensional table — multiple columns sharing one index. This is what you will use 95% of the time. You can build one from a dictionary:

df = pd.DataFrame({
    "order_id":  [101, 102, 103, 104],
    "city":      ["Mumbai", "Pune", "Mumbai", "Delhi"],
    "amount":    [1200, 3400, 850, 2100],
    "payment":   ["UPI", "Card", "COD", "UPI"]
})

Each column is really a Series; a DataFrame is a collection of Series that line up on the same index. When you pull out one column you get a Series back, and when you keep several columns you get a smaller DataFrame. Holding that picture in your head makes the rest of pandas click.

Reading Real Data with read_csv

In practice your data lives in files, usually a CSV exported from a billing app, a marketplace dashboard, or your own SQL query. read_csv is the function you will run most:

df = pd.read_csv("flipkart_orders.csv")

Real CSVs are messy, so a few options earn their keep on almost every project:

df = pd.read_csv(
    "flipkart_orders.csv",
    parse_dates=["order_date"],   # treat this column as real dates
    dtype={"pincode": str},       # keep pincodes as text, not numbers
    thousands=",",                # parse "1,200" as 1200
    na_values=["NA", "-", ""]     # treat these strings as missing
)

Parsing dates and keeping identifiers like pincodes as strings up front saves hours of cleanup later — a pincode read as a number loses any leading zero and cannot be joined cleanly. pandas also reads Excel (read_excel), JSON (read_json), and SQL tables (read_sql) with the same pattern, so the skill transfers directly to whatever format your data arrives in.

Inspecting Your Data

Never start analysis blind. Four commands tell you almost everything about a new dataset.

head() shows the first rows (default 5), and tail() the last:

df.head()
df.head(10)   # first 10
df.tail()     # last rows

info() reveals column names, non-null counts, and data types — your first stop for spotting missing values and wrong types:

df.info()
# RangeIndex: 50000 entries
# Column      Non-Null Count   Dtype
# order_id    50000 non-null   int64
# city        49980 non-null   object
# amount      50000 non-null   float64
# order_date  50000 non-null   datetime64[ns]

Here city has 49,980 non-null values out of 50,000 — twenty are missing. object usually means text.

describe() gives summary statistics for numeric columns — count, mean, standard deviation, min, the quartiles, and max:

df.describe()

And dtypes is a quick way to confirm every column was read as the type you expected:

df.dtypes
# order_id              int64
# city                 object
# amount              float64
# order_date   datetime64[ns]

Also handy: df.shape (rows, columns) and df.columns (column names). Run these every single time you open a new file — it is the cheapest insurance against silly mistakes downstream.

Selecting Columns and Rows

Pick one column (returns a Series):

df["amount"]

Pick multiple columns (returns a DataFrame — note the double brackets):

df[["city", "amount"]]

Select rows by label with .loc and by position with .iloc:

df.loc[0]                        # the row with index label 0
df.loc[0, "amount"]              # one cell
df.iloc[0]                       # the first row by position
df.iloc[0:5]                     # first five rows
df.loc[0:5, ["city", "amount"]]  # rows and chosen columns

The .loc vs .iloc distinction confuses beginners: .loc uses your index labels, .iloc uses integer positions. With a default 0, 1, 2 index they look the same, but they diverge the moment you filter rows or set a custom index, so it is worth learning both now.

Filtering Rows with Boolean Masks

This is where pandas shines. You build a boolean condition — a Series of True/False values called a mask — and put it inside the brackets:

# Orders above Rs 2000
big_orders = df[df["amount"] > 2000]

# UPI orders in Mumbai (combine with & and wrap each condition in parentheses)
mumbai_upi = df[(df["city"] == "Mumbai") & (df["payment"] == "UPI")]

# Orders from a list of cities
metros = df[df["city"].isin(["Mumbai", "Delhi", "Bengaluru"])]

Two beginner traps: you must use &, |, ~ (not and, or, not), and each condition needs its own parentheses because of operator precedence. Get these two right and filtering becomes second nature.

Creating and Transforming Columns

Add a derived column with simple assignment:

df["amount_with_gst"] = df["amount"] * 1.18      # add 18% GST
df["is_high_value"] = df["amount"] > 2000        # boolean flag

For text or conditional transforms, use vectorised string methods or apply:

df["city_upper"] = df["city"].str.upper()
df["tier"] = df["amount"].apply(lambda x: "High" if x > 2000 else "Low")

Prefer vectorised operations (df["amount"] * 1.18) over apply with a Python function when you can — they run far faster on big data because the work happens in optimised C under the hood.

Handling Missing Values

Real datasets always have gaps — a blank city, a missing amount, a customer who never entered a pincode. pandas marks these as NaN. First, find them:

df.isna().sum()    # count missing values per column

Then decide what to do. Drop rows that are missing critical fields, or fill gaps with a sensible default:

df = df.dropna(subset=["amount"])         # drop rows with no amount
df["city"] = df["city"].fillna("Unknown") # fill missing city labels
df["discount"] = df["discount"].fillna(0) # assume no discount

There is no single right answer — dropping is safer when the missing value would distort a total, while filling keeps rows you still need. The key habit is to look at the gaps deliberately rather than letting them silently break a later calculation.

Grouping and Aggregating

groupby is the pandas equivalent of SQL's GROUP BY, and it is the single most useful tool for turning raw orders into business answers. Total revenue per city:

df.groupby("city")["amount"].sum()
# city
# Delhi      2100
# Mumbai     2050
# Pune       3400
# Name: amount, dtype: int64

Several aggregates at once — total, average, and order count per city:

df.groupby("city")["amount"].agg(["sum", "mean", "count"])

Group by two columns, say city and payment mode, to see how UPI adoption varies across metros:

df.groupby(["city", "payment"])["amount"].sum()

This single line answers questions that would take many pivot-table clicks in Excel, and it scales to millions of rows without slowing down.

Sorting and Counting

Once you have a result, you usually want it ordered. sort_values ranks rows by a column:

# Biggest orders first
df.sort_values("amount", ascending=False).head(10)

# Top cities by revenue
city_revenue = df.groupby("city")["amount"].sum()
city_revenue.sort_values(ascending=False)

And value_counts is the fastest way to see how often each category appears — perfect for a kirana store checking which payment mode customers prefer:

df["payment"].value_counts()
# UPI     2
# Card    1
# COD     1
# Name: count, dtype: int64

df["payment"].value_counts(normalize=True)   # as a share, e.g. 0.5 = 50%

That normalize=True turns raw counts into proportions, so you can report "half of all orders were paid by UPI" instead of just the count.

A pandas vs Excel / SQL Mental Model

If you come from Excel or SQL, you already know the concepts — pandas just renames them:

  • A DataFrame is your worksheet or your SQL table. A column is a Series.
  • Filtering rows (df[df["amount"] > 2000]) is AutoFilter in Excel, or WHERE amount > 2000 in SQL.
  • A derived column (df["gst"] = df["amount"] * 0.18) is a formula dragged down a column, or a computed field in SELECT.
  • groupby(...).sum() is a pivot table, or GROUP BY ... SUM(...) in SQL.
  • merge (not covered here) is VLOOKUP, or a SQL JOIN.

The big differences are scale and repeatability. Excel struggles past a few hundred thousand rows and breaks the moment you reopen the file with new data; a pandas script runs the same steps on a fresh CSV in seconds, every time. Compared with SQL, pandas keeps the whole pipeline — load, clean, transform, chart — in one place, so you are not bouncing between a database and a spreadsheet. Think of pandas as Excel's power plus SQL's logic, written as code you can re-run and trust.

Best Practices for Beginners

  • Always run head(), info(), and dtypes before anything else on a new file.
  • Parse dates and fix dtypes at read_csv time, not later.
  • Check missing values with df.isna().sum() early, and handle them on purpose.
  • Use vectorised operations over apply/loops for speed.
  • Wrap each filter condition in parentheses and use &/|, not and/or.
  • Avoid chained assignment like df[df.x > 1]["y"] = 0; it triggers warnings and silent bugs. Use .loc instead.

FAQ

Is pandas hard to learn?
Not especially. If you can already build formulas and pivot tables in Excel, the ideas in pandas will feel familiar — you are mostly learning new names and a code-based way to express them. The trickiest early hurdles are .loc vs .iloc and boolean filtering syntax, and both click within a week of daily practice. Most learners are productive in two to three weeks of consistent effort.

Do data analysts need Python?
For many entry-level roles in India, strong Excel and SQL are enough to get hired. But Python with pandas is increasingly expected and is the clearest way to stand out, automate repetitive work, and handle datasets that crash a spreadsheet. If you want to grow toward senior analyst, analytics-engineer, or data-science roles, learning pandas now is one of the highest-return things you can do.

pandas vs Excel — which should I use?
Use whichever fits the job. Excel is unbeatable for quick ad-hoc looks, small datasets, and sharing a sheet with non-technical colleagues. Reach for pandas when the data is large, when you will repeat the same cleaning steps regularly, or when you need an auditable, re-runnable pipeline. In practice analysts use both — explore in Excel, then move serious or recurring work into pandas.

Is pandas the same as Python?
No. Python is the programming language; pandas is a library you install on top of it for working with tabular data. You write ordinary Python and call pandas functions through the pd alias. You need only basic Python — variables, lists, and functions — to be effective with pandas.

How long does it take to learn pandas?
With focused daily practice on real datasets, most beginners reach a comfortable working level in three to four weeks: loading data, inspecting it, filtering, deriving columns, and grouping. Mastery of the trickier corners — multi-index frames, time series, performance tuning — takes longer, but you do not need any of that to start delivering useful analysis.

That is the core toolkit. With Series, DataFrames, read_csv, the inspection commands, selection, filtering, missing-value handling, groupby, sort_values, and value_counts, you can already replace most of what you did in Excel — and tackle data ten times the size.

Related: pandas groupby Explained (with Real Examples) · Practise Python

Don't just read. Prove your skill on DevWithData.

DevWithData

Shashikant

· Founder, DevWithData

Data professional and Power BI instructor. Building DevWithData to help analysts prove their skills, not just collect certificates.

Reading is not enough. Prove your skill.

DevWithData measures your actual ability with the Data Readiness Index. Stop reading — start practicing.

Continue Learning