Using pandas read_csv iterator for streaming data

Using pandas read_csv iterator for streaming data is one of the most effective ways to process large CSV files without loading the entire dataset into memory at once. By setting the chunksize parameter in pd.read_csv(), pandas returns a TextFileReader iterator that yields DataFrame chunks of a specified number of rows — perfect for gigabyte-scale files, memory-constrained environments, or streaming ETL pipelines. In 2026, this pattern remains essential for data engineering, analysis, and production workflows — especially when combined with Polars streaming for even better performance.

Here’s a complete, practical guide to using the read_csv iterator for streaming large CSVs: basic chunked reading, real-world processing patterns, aggregation across chunks, error handling, and modern best practices with type hints, Polars alternatives, and scalability tips.

The core idea is simple: pass chunksize to pd.read_csv() — it returns an iterator you loop over with for chunk in reader:. Each chunk is a DataFrame with up to chunksize rows — process it, aggregate results, write to database, or discard — memory stays constant regardless of file size.


import pandas as pd

file_path = "large_sales.csv"
chunk_size = 100_000  # Adjust based on available RAM — 100k rows is usually safe

# Create the chunk iterator
csv_iterator = pd.read_csv(file_path, chunksize=chunk_size, encoding="utf-8")

total_sales = 0.0
total_rows = 0

for i, chunk in enumerate(csv_iterator):
    # Clean and process chunk
    chunk = chunk.dropna(subset=["amount"])  # Drop rows with missing amount
    chunk["amount"] = pd.to_numeric(chunk["amount"], errors="coerce")  # Convert to float
    chunk = chunk.dropna(subset=["amount"])  # Drop conversion failures

    # Aggregate
    chunk_total = chunk["amount"].sum()
    total_sales += chunk_total
    total_rows += len(chunk)

    print(f"Chunk {i+1}: {len(chunk):,} rows, partial sum = ${chunk_total:,.2f}")

print(f"\nFinal: ${total_sales:,.2f} over {total_rows:,} rows")

Real-world pattern: streaming aggregation and filtering — accumulate running totals, counts, or stats across chunks without ever holding the full dataset.


# Running stats: min/max/mean price per chunk
running_min = float("inf")
running_max = float("-inf")
total_price = 0.0
total_count = 0

for chunk in pd.read_csv("products.csv", chunksize=50_000):
    chunk = chunk[chunk["price"] > 0]  # Filter invalid prices
    if chunk.empty:
        continue

    chunk_min = chunk["price"].min()
    chunk_max = chunk["price"].max()
    chunk_sum = chunk["price"].sum()
    chunk_count = len(chunk)

    running_min = min(running_min, chunk_min)
    running_max = max(running_max, chunk_max)
    total_price += chunk_sum
    total_count += chunk_count

    print(f"Chunk min/max: ${chunk_min:,.2f} / ${chunk_max:,.2f}")

print(f"Overall min/max/avg: ${running_min:,.2f} / ${running_max:,.2f} / ${total_price/total_count:,.2f}")

Best practices make chunked reading safe, fast, and maintainable. Choose chunk size wisely — 10k–100k rows for pandas (balance memory vs. I/O overhead); too small = high overhead, too large = memory issues. Use dtype and parse_dates in read_csv() — reduces memory and speeds up parsing. Handle per-chunk errors — wrap processing in try/except — skip bad chunks or log issues instead of crashing. Modern tip: switch to Polars for large files — pl.scan_csv("huge.csv").filter(...).collect(streaming=True) is 10–100× faster, with true streaming and lazy evaluation. Use low_memory=False in pandas if needed — avoids dtype inference issues. In production, add progress tracking (tqdm on chunks) and logging — monitor processed rows, errors, and estimated time. Combine with pd.concat() only if you really need the full DataFrame — prefer per-chunk aggregation or writing to database/files incrementally.

The read_csv iterator turns huge CSVs into streamable data — process gigabytes or terabytes with constant memory. In 2026, use chunking for pandas, switch to Polars streaming for scale, and handle errors per chunk. Master this pattern, and you’ll load and process large data safely, efficiently, and Pythonically — no more OOM crashes.

Next time you face a large CSV — don’t load it all. Use chunksize, iterate, process one piece at a time. It’s Python’s cleanest way to handle big data limits.

Generating content...