Data Manipulation with Pandas

Pandas remains the go-to Python library for data manipulation and analysis in 2026 — and for good reason. It turns messy, real-world data into clean, structured DataFrames and Series that are easy to explore, clean, transform, and visualize.

Whether you're cleaning CSV exports, joining datasets, aggregating sales figures, or preparing data for machine learning, Pandas makes these tasks fast and intuitive. Here's a practical, hands-on guide to the core data manipulation techniques you need to know.

1. Installation & Import


pip install pandas
# Optional (for speed on large data): pip install polars


import pandas as pd
# Optional: import polars as pl  # faster alternative for big data

2. Loading Data

Pandas reads from almost any source — CSV, Excel, SQL, JSON, Parquet, clipboard, etc.


# CSV (most common)
df = pd.read_csv("sales_data.csv")

# Excel
df = pd.read_excel("report.xlsx", sheet_name="Q4")

# SQL database
df = pd.read_sql("SELECT * FROM orders WHERE year=2026", con=engine)

# Clipboard (great for quick copy-paste)
df = pd.read_clipboard()

Tip 2026: For large files (>1 GB), use chunksize=10000 or switch to Polars: pl.read_csv("large.csv") — often 5–20× faster.

3. Selecting & Filtering Data

Access columns, rows, or subsets efficiently.


# Single column
names = df["customer_name"]

# Multiple columns
subset = df[["customer_name", "order_total", "date"]]

# Rows by condition
high_value = df[df["order_total"] > 1000]

# Multiple conditions
vip_customers = df[(df["order_total"] > 500) & (df["country"] == "USA")]

# Query method (very readable)
recent_vip = df.query("order_date >= '2026-01-01' and order_total > 1000")

4. Adding & Removing Columns

Create new columns or clean existing ones.


# Add new column
df["tax"] = df["order_total"] * 0.08

# Calculated column
df["full_name"] = df["first_name"] + " " + df["last_name"]

# Drop columns
df = df.drop(columns=["temp_column", "old_field"])

# Rename columns
df = df.rename(columns={"old_name": "new_name"})

5. Handling Missing Data

Missing values (NaN) are common — Pandas makes them easy to detect and fix.


# Check for missing values
df.isna().sum()

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values
df["price"] = df["price"].fillna(df["price"].median())

# Forward fill (carry last valid value)
df["status"] = df["status"].ffill()

6. Grouping & Aggregation

Group by category and compute statistics — core to data analysis.


# Total sales by customer
sales_by_customer = df.groupby("customer_id")["order_total"].sum()

# Multiple aggregations
summary = df.groupby("category").agg({
    "order_total": ["sum", "mean", "count"],
    "quantity": "sum"
})

# Pivot table (Excel-like)
pivot = df.pivot_table(
    values="order_total",
    index="category",
    columns="country",
    aggfunc="sum"
)

7. Merging & Joining DataFrames

Combine datasets like SQL joins.


# Inner join
merged = customers.merge(orders, on="customer_id", how="inner")

# Left join
all_customers = customers.merge(orders, on="customer_id", how="left")

# Concatenate vertically
combined = pd.concat([df1, df2], ignore_index=True)

Conclusion

Pandas is the Swiss Army knife of data manipulation in Python — it gives you powerful, expressive tools to load, clean, transform, aggregate, and visualize data with minimal code. In 2026, with huge datasets and real-time needs, master these core techniques, and consider Polars for performance-critical work.

Once you get comfortable with Pandas, data wrangling becomes fast, intuitive, and even fun — and that's when the real insights start flowing.