Filtering a chunk

Filtering a chunk when reading large CSV files with pd.read_csv(chunksize=...) is a key technique for memory efficiency — apply filters, transformations, or aggregations to each chunk immediately, discard unnecessary rows early, and only keep (or append) the processed results. This drastically reduces peak memory usage, speeds up processing, and avoids loading irrelevant data into the final DataFrame. In 2026, chunk filtering is standard practice for big data ETL, cleaning, feature engineering, or subsetting — especially when files are gigabytes large, RAM is limited, or you're only interested in specific rows (e.g., recent dates, high-value records, certain categories). It pairs perfectly with pandas query, boolean indexing, and Polars lazy filtering for even greater efficiency.

Here’s a complete, practical guide to filtering chunks with pd.read_csv(chunksize=...): basic filtering, multiple conditions, real-world patterns (date ranges, categorical filters, value thresholds), collecting results, and modern best practices with type hints, memory optimization, error handling, and Polars lazy equivalents.

Basic chunk filtering — use boolean indexing or query to keep only matching rows per chunk.


import pandas as pd

file_path = 'large_file.csv'
chunksize = 100_000

filtered_chunks = []

for chunk in pd.read_csv(file_path, chunksize=chunksize):
    # Filter example: keep only rows where 'category' == 'A' and 'value' > 100
    filtered = chunk[(chunk['category'] == 'A') & (chunk['value'] > 100)]
    # Alternative with query (often more readable)
    # filtered = chunk.query("category == 'A' and value > 100")
    
    if not filtered.empty:
        filtered_chunks.append(filtered)

# Final DataFrame from filtered chunks
df_filtered = pd.concat(filtered_chunks, ignore_index=True)
print(f"Total filtered rows: {len(df_filtered)}")

Filtering with multiple conditions — combine with & (and), | (or), ~ (not), and parentheses for complex logic.


for chunk in pd.read_csv(file_path, chunksize=chunksize):
    # Complex filter: category in ['A', 'B'], value between 50 and 500, or status == 'active'
    mask = (
        chunk['category'].isin(['A', 'B']) &
        chunk['value'].between(50, 500) |
        (chunk['status'] == 'active')
    )
    filtered = chunk[mask]
    filtered_chunks.append(filtered)

Real-world pattern: date-range filtering in time-series CSV — parse dates in chunks and filter recent records.


filtered_chunks = []

for chunk in pd.read_csv(file_path, chunksize=100_000, parse_dates=['date']):
    # Filter rows from 2024 onwards
    recent = chunk[chunk['date'] >= '2024-01-01']
    if not recent.empty:
        filtered_chunks.append(recent)

df_recent = pd.concat(filtered_chunks, ignore_index=True)
print(df_recent['date'].min(), df_recent['date'].max())

Best practices make chunk filtering efficient, readable, and scalable. Filter early — apply conditions right after reading chunk to discard rows immediately. Modern tip: prefer Polars scan_csv().filter(...) — lazy filtering is faster, lower memory, no need to collect chunks manually. Use boolean indexing or query — query is often more readable for complex conditions. Use usecols + filtering — read only needed columns to minimize memory per chunk. Use dtype — downcast early (int32/float32) to reduce memory. Handle empty chunks — check if not filtered.empty before appending. Write filtered chunks to disk — filtered.to_parquet(f'chunk_{i}.parquet', index=False) — avoid holding all in memory. Monitor memory — psutil.Process().memory_info().rss before/after chunks. Add type hints — def process_chunk(chunk: pd.DataFrame) -> pd.DataFrame. Use chunksize with iterator=True — explicit control. Use low_memory=False — avoid mixed-type warnings in chunks. Combine with dask — for distributed chunk filtering if pandas is too slow. Use pd.read_csv(..., engine='pyarrow') — faster CSV parsing with chunking.

Filtering chunks with pd.read_csv(chunksize=...) discards irrelevant rows early — reduce memory, speed up processing, and focus on what matters. In 2026, filter with boolean indexing or query, prefer Polars lazy scan_csv().filter() for superior performance, use usecols/dtype, and write to disk per chunk if needed. Master chunk filtering, and you’ll process massive CSVs efficiently, reliably, and with minimal memory footprint.

Next time you read a large CSV — filter chunks. It’s Python’s cleanest way to say: “Process only what I need — discard the rest immediately.”

Generating content...