DataFrame manipulation

Pandas DataFrame manipulation is the backbone of data wrangling in Python — from loading raw data to cleaning, transforming, aggregating, and preparing it for modeling or reporting. In 2026, pandas remains the standard, but Polars offers dramatic speed gains for large datasets. Here’s a complete, practical guide to the most essential operations, with modern code patterns and best practices.

1. Creating a DataFrame (Most Common Sources)


import pandas as pd
import polars as pl

# From list of dicts (by row) — very common from APIs/JSON
data_rows = [
    {'name': 'John', 'age': 30, 'gender': 'M'},
    {'name': 'Jane', 'age': 25, 'gender': 'F'},
    {'name': 'Mike', 'age': 35, 'gender': 'M'},
    {'name': 'Susan', 'age': 40, 'gender': 'F'}
]
df = pd.DataFrame(data_rows)

# From dict of lists (by column) — great for columnar data
data_cols = {
    'name': ['John', 'Jane', 'Mike', 'Susan'],
    'age': [30, 25, 35, 40],
    'gender': ['M', 'F', 'M', 'F']
}
df_cols = pd.DataFrame(data_cols)

# From CSV (real-world staple)
# df_csv = pd.read_csv('data.csv', parse_dates=['date_column'])

# Polars equivalents (faster for big data)
df_pl_rows = pl.DataFrame(data_rows)
df_pl_cols = pl.DataFrame(data_cols)

2. Viewing & Inspecting DataFrames


# First/last rows
print(df.head(3))
print(df.tail(2))

# Quick info (types, non-null counts, memory)
print(df.info())

# Descriptive stats for numeric columns
print(df.describe())

# Polars style (cleaner output)
print(df_pl.describe())

3. Filtering Rows & Selecting Columns


# Filter rows (boolean indexing)
df_adult = df[df['age'] > 30]

# Filter + select columns
df_adult_names = df.loc[df['age'] > 30, ['name', 'age']]

# Multiple conditions
df_young_female = df[(df['age'] < 30) & (df['gender'] == 'F')]

# Polars (faster & more readable)
df_pl_adult = df_pl.filter(pl.col('age') > 30)

4. Adding & Removing Columns


# Add new column
df['salary'] = [50000, 60000, 70000, 80000]

# Add computed column
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 40, 100], labels=['Young', 'Mid', 'Senior'])

# Remove column
df = df.drop(columns=['salary'])

# Polars (immutable style — returns new frame)
df_pl = df_pl.with_columns(
    (pl.col('age') * 12).alias('age_months')
).drop('salary')

5. Grouping & Aggregation


# Group by gender, mean age
print(df.groupby('gender')['age'].mean())

# Multiple aggregations
print(df.groupby('gender').agg({
    'age': ['mean', 'min', 'max'],
    'name': 'count'
}))

# Polars (expressive & fast)
print(df_pl.group_by('gender').agg(
    mean_age=pl.col('age').mean(),
    count=pl.col('name').count()
))

6. Merging / Joining DataFrames


df1 = pd.DataFrame({'name': ['John', 'Jane'], 'age': [30, 25]})
df2 = pd.DataFrame({'name': ['John', 'Mike'], 'salary': [50000, 70000]})

# Inner merge
df_merged = pd.merge(df1, df2, on='name', how='inner')

# Polars join
df_pl_merged = df_pl1.join(df_pl2, on='name', how='inner')

Best Practices & Common Pitfalls (2026 Edition)

Use parse_dates in read_csv — saves time later
Avoid chained indexing (df[df['age']>30]['name']) ? use loc or query()
Pandas gotcha: groupby keeps index — reset with .reset_index() if needed
Large data? Switch to Polars — 5–20× faster for groupby, joins, filtering
Always check .shape and .info() after operations — catch silent errors
Production: use method chaining (.assign(), .pipe()) for readable pipelines

Conclusion

Pandas (and Polars) DataFrame manipulation is where raw data becomes insight. In 2026, master creation from dicts/lists/CSV, filtering with loc or query, adding/removing columns, groupby aggregations, and merging — then choose pandas for flexibility or Polars for speed on big data. These operations power 90% of your daily work — from cleaning to feature engineering to reporting. Practice them, visualize outputs, and you’ll turn messy data into clean, actionable tables in minutes.

Next time you load data — start manipulating it immediately. The faster you get comfortable with these patterns, the faster you’ll deliver value.