Chunking & filtering together

Chunking & filtering together is one of the most powerful memory-efficient techniques for processing large CSV files in pandas — read the file in small chunks with chunksize, immediately filter each chunk to keep only relevant rows, perform transformations or aggregations on the filtered data, and either append results incrementally or write them to disk/database. This approach avoids loading the entire file into memory, drastically reduces peak RAM usage, speeds up processing by discarding unneeded data early, and enables scalable ETL, cleaning, subsetting, or feature engineering on gigabyte-scale files. In 2026, chunking + filtering is a must-know pattern — especially when combined with pd.concat, to_parquet, or Polars lazy filtering for even better performance and lower memory footprint.

Here’s a complete, practical guide to chunking and filtering together with pd.read_csv(chunksize=...): basic pattern, multiple conditions, aggregation per chunk, collecting/writing results, real-world patterns, and modern best practices with type hints, memory optimization, error handling, and Polars lazy equivalents.

Basic chunking + filtering — read chunk, filter immediately, append only matching rows.


import pandas as pd

file_path = 'large_file.csv'
chunksize = 100_000

filtered_chunks = []

for chunk in pd.read_csv(file_path, chunksize=chunksize):
    # Filter: keep only rows where category == 'A' and value > 100
    filtered = chunk[(chunk['category'] == 'A') & (chunk['value'] > 100)]
    
    # Optional: further process filtered chunk
    filtered['value_doubled'] = filtered['value'] * 2
    
    if not filtered.empty:
        filtered_chunks.append(filtered)

# Final DataFrame from filtered chunks
if filtered_chunks:
    df_filtered = pd.concat(filtered_chunks, ignore_index=True)
    print(f"Total filtered rows: {len(df_filtered)}")
else:
    df_filtered = pd.DataFrame()

Filtering with complex conditions — use boolean indexing, query, or isin for readability and performance.


for chunk in pd.read_csv(file_path, chunksize=chunksize):
    # Complex filter: category in ['A', 'B'], value between 50 and 500, status active
    filtered = chunk[
        chunk['category'].isin(['A', 'B']) &
        chunk['value'].between(50, 500) &
        (chunk['status'] == 'active')
    ]
    
    # Alternative with query (often more readable)
    # filtered = chunk.query("category in ['A', 'B'] and value >= 50 and value <= 500 and status == 'active'")
    
    filtered_chunks.append(filtered)

Per-chunk aggregation — compute stats or groups within each chunk, accumulate results.


running_total = 0
running_count = 0

for chunk in pd.read_csv(file_path, chunksize=100_000):
    filtered = chunk[chunk['value'] > 100]
    if not filtered.empty:
        running_total += filtered['value'].sum()
        running_count += len(filtered)

average = running_total / running_count if running_count > 0 else 0
print(f"Average value (filtered): {average:.2f}")

Real-world pattern: date-range filtering + incremental Parquet write — process time-series CSV without full load.


filtered_chunks = []

for i, chunk in enumerate(pd.read_csv(file_path, chunksize=100_000, parse_dates=['date'])):
    # Filter recent data (last 2 years)
    recent = chunk[chunk['date'] >= '2024-01-01']
    
    if not recent.empty:
        # Optional: clean/transform
        recent = recent.dropna(subset=['value'])
        recent.to_parquet(f'filtered_chunk_{i}.parquet', index=False)
        print(f"Wrote chunk {i}: {len(recent)} rows")

Best practices make chunking + filtering efficient, readable, and scalable. Filter immediately after reading chunk — discard unneeded rows early to minimize memory. Modern tip: prefer Polars scan_csv().filter(...).collect() or .sink_parquet() — lazy filtering is faster, lower memory, no manual chunk concatenation. Use usecols — read only needed columns: pd.read_csv(..., usecols=['date', 'value']). Use dtype — downcast early: dtype={'value': 'float32'}. Use query for readability — complex conditions. Handle empty chunks — if not filtered.empty before append/write. Write filtered chunks to disk — Parquet per chunk for resumability/scalability. Monitor memory — psutil.Process().memory_info().rss before/after chunks. Add type hints — def process_chunk(chunk: pd.DataFrame) -> pd.DataFrame. Use low_memory=False — avoid mixed-type warnings. Use engine='pyarrow' — faster CSV parsing. Use converters — custom parsing per chunk if needed. Combine with dask — for distributed chunk processing if pandas is too slow. Use pd.read_csv(..., iterator=True) — explicit chunk control.

Chunking + filtering with pd.read_csv(chunksize=...) processes large CSVs efficiently — read batch, filter immediately, append/write only relevant rows. In 2026, filter early, prefer Polars lazy scan_csv().filter(), use usecols/dtype, write per chunk to Parquet, and monitor memory. Master this pattern, and you’ll handle massive CSVs scalably, reliably, and with minimal memory footprint.

Next time you read a large CSV — chunk and filter. It’s Python’s cleanest way to say: “Process only what matters — discard the rest right away.”

Generating content...