Creating DataFrames with Pandas

Pandas remains the cornerstone library for data manipulation and analysis in Python in 2026. Its DataFrame and Series objects turn messy, real-world data into clean, powerful structures that are easy to explore, clean, transform, aggregate, and visualize.

Whether you’re cleaning CSV exports, joining datasets, summarizing sales, preparing features for machine learning, or building dashboards, Pandas makes these tasks fast and expressive. Here’s a practical, hands-on guide to the core data manipulation techniques every Python data scientist should master.

1. Installation & Import


pip install pandas
# For speed on large data: pip install polars (often 5–20× faster)


import pandas as pd
# Optional fast alternative
# import polars as pl

2. Loading Data (Most Common Sources)

Pandas reads from almost any source — CSV, Excel, JSON, Parquet, SQL, clipboard, etc.


# CSV (most common)
df = pd.read_csv("sales_2026.csv")

# Excel (specify sheet)
df = pd.read_excel("report.xlsx", sheet_name="Q4")

# SQL database
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://user:pass@localhost/db")
df = pd.read_sql("SELECT * FROM orders WHERE year=2026", engine)

# JSON API response
df = pd.read_json("https://api.example.com/data")

# Clipboard (great for quick copy-paste from Excel)
df = pd.read_clipboard()

Tip 2026: For files >1 GB or speed-critical work, use Polars: pl.read_csv("large.csv") — it’s dramatically faster and memory-efficient.

3. Selecting & Filtering Data

Access columns, rows, or subsets with clean, readable syntax.


# Single column
names = df["customer_name"]

# Multiple columns
subset = df[["customer_name", "order_total", "date"]]

# Rows by condition
high_value = df[df["order_total"] > 1000]

# Multiple conditions
vip_us = df[(df["order_total"] > 500) & (df["country"] == "USA")]

# Query method (very readable SQL-like syntax)
recent_vip = df.query("order_date >= '2026-01-01' and order_total > 1000")

4. Adding, Modifying & Removing Columns

Create new columns or clean existing ones with simple assignment.


# New calculated column
df["tax_amount"] = df["order_total"] * 0.08

# String operation
df["full_name"] = df["first_name"] + " " + df["last_name"]

# Drop columns
df = df.drop(columns=["temp_column", "old_field"])

# Rename columns
df = df.rename(columns={"old_name": "new_name"})

5. Handling Missing Data (NaN)

Missing values are inevitable — Pandas makes them easy to find and fix.


# Count missing per column
df.isna().sum()

# Drop rows with any missing
df_clean = df.dropna()

# Fill with column median
df["price"] = df["price"].fillna(df["price"].median())

# Forward fill (carry last valid value forward)
df["status"] = df["status"].ffill()

6. Grouping & Aggregation (Core to Analysis)

Group by category and compute summaries — this is where Pandas shines.


# Total sales by customer
sales_by_customer = df.groupby("customer_id")["order_total"].sum()

# Multiple aggregations
summary = df.groupby("category").agg({
    "order_total": ["sum", "mean", "count"],
    "quantity": "sum"
})

# Pivot table (Excel-like)
pivot = df.pivot_table(
    values="order_total",
    index="category",
    columns="country",
    aggfunc="sum"
)

7. Merging & Joining DataFrames

Combine datasets like SQL joins.


# Inner join
merged = customers.merge(orders, on="customer_id", how="inner")

# Left join (keep all customers)
all_customers = customers.merge(orders, on="customer_id", how="left")

# Vertical concatenation
combined = pd.concat([df_q1, df_q2], ignore_index=True)

Conclusion

Pandas is the Swiss Army knife of data manipulation in Python — it gives you expressive, powerful tools to load, clean, filter, transform, aggregate, and visualize data with minimal code. In 2026, master these core techniques, and for large-scale or performance-critical work, pair Pandas with Polars.

Once you’re comfortable with Pandas, data wrangling becomes fast, intuitive, and even enjoyable — and that’s when the real insights start flowing.