Using pd.read_csv() with chunksize is the standard, memory-efficient way to read large CSV files in pandas — instead of loading the entire file into memory at once, chunksize reads the file in smaller batches (e.g., 10,000 rows at a time), allowing you to process each chunk incrementally, filter/aggregate/transform on-the-fly, and avoid OOM (Out of Memory) errors. In 2026, this technique is essential for big data workflows — especially when files are gigabytes large, servers have limited RAM, or you're streaming data in ETL pipelines, data cleaning, feature engineering, or exploratory analysis. It combines perfectly with pandas concat, Polars lazy scanning, or custom chunk processors for scalable, low-memory CSV handling.
Here’s a complete, practical guide to using pd.read_csv() with chunksize in Python: basic chunked reading, processing each chunk, aggregating results, real-world patterns, and modern best practices with type hints, memory optimization, error handling, and Polars comparison.
Basic chunked reading — chunksize returns an iterator of DataFrame chunks; process or collect as needed.
import pandas as pd
file_path = 'large_file.csv'
chunksize = 100_000 # adjust based on available RAM
# Read in chunks
for chunk in pd.read_csv(file_path, chunksize=chunksize):
# Process chunk: filter, transform, etc.
chunk = chunk[chunk['value'] > 100] # example filter
# ... more processing ...
print(f"Processed {len(chunk)} rows")
Collecting results incrementally — append processed chunks to a list, then concat at the end (or write to disk per chunk).
chunks_processed = []
for chunk in pd.read_csv(file_path, chunksize=chunksize):
# Example: filter and add a computed column
chunk = chunk[chunk['category'] == 'A']
chunk['value_doubled'] = chunk['value'] * 2
chunks_processed.append(chunk)
# Final DataFrame
df = pd.concat(chunks_processed, ignore_index=True)
print(df.shape)
Real-world pattern: memory-efficient aggregation in pandas/Polars — compute running totals, group-by, or filter without full load.
# Pandas: running sum across chunks
total_sum = 0
for chunk in pd.read_csv(file_path, chunksize=100_000):
total_sum += chunk['value'].sum()
print(f"Total value sum: {total_sum}")
# Polars equivalent (lazy, even better)
import polars as pl
total_pl = pl.scan_csv(file_path).select(pl.col('value').sum()).collect().item()
print(f"Polars total: {total_pl}")
Best practices make chunked reading safe, efficient, and scalable. Choose chunksize wisely — 10k–100k rows balances memory vs I/O; too small = many small reads, too large = OOM risk. Modern tip: prefer Polars scan_csv() — lazy, columnar, faster, lower memory; use .collect() or .sink_* only when needed. Use usecols — read only needed columns: pd.read_csv(..., usecols=['id', 'value']). Use dtype — downcast types: dtype={'value': 'float32'}. Use low_memory=False — for mixed dtypes in chunks. Process and discard chunks — avoid keeping all in memory; write to Parquet/DB per chunk. Use converters — custom parsing per chunk. Handle errors — wrap in try/except per chunk. Use chunksize with iterator=True — explicit control. Monitor memory — psutil.Process().memory_info().rss before/after chunks. Use Polars lazy streaming — pl.scan_csv().filter(...).collect() — often 2–10× faster/lower memory than pandas chunks. Add type hints — def process_chunk(chunk: pd.DataFrame) -> pd.DataFrame. Use gc.collect() after each chunk — force cleanup if needed. Export per chunk — chunk.to_parquet(f'chunk_{i}.parquet', index=False).
Using pd.read_csv(chunksize=...) reads large CSVs in batches — process incrementally, avoid OOM, aggregate on-the-fly. In 2026, prefer Polars scan_csv() for lazy, faster, lower-memory alternative; use usecols, dtype, per-chunk processing, and memory monitoring. Master chunked reading, and you’ll handle gigabyte-scale CSVs efficiently and reliably.
Next time you face a large CSV — use chunksize. It’s Python’s cleanest way to say: “Read this file piece by piece — no memory explosion.”