Loading data in chunks

Loading data in chunks is a critical technique when working with large datasets — files too big to fit in memory (CSVs, logs, JSONL, Parquet, etc.) can crash your program or slow it to a crawl if loaded all at once. Instead, process data in smaller pieces (chunks) using iterators or streaming readers — memory usage stays constant, and you can handle gigabytes or terabytes on modest hardware. In 2026, this pattern is essential for data pipelines, ETL jobs, big data analysis, and production scripts.

Here’s a complete, practical guide to loading and processing data in chunks: pandas chunking, manual iteration, generators, real-world CSV/JSONL examples, and modern best practices with Polars and memory efficiency.

The easiest way in pandas is read_csv(chunksize=...) — it returns an iterator of DataFrames, each with a fixed number of rows. You process one chunk at a time — aggregate, filter, write to database, etc. — without ever loading the full file.


import pandas as pd

chunk_size = 100_000  # Adjust based on memory — 100k rows is usually safe

# Iterator over chunks
csv_chunks = pd.read_csv("huge_dataset.csv", chunksize=chunk_size)

total_rows = 0
running_total = 0.0

for i, chunk in enumerate(csv_chunks):
    # Process chunk: clean, filter, compute
    chunk = chunk.dropna(subset=["amount"])  # Example cleaning
    chunk_total = chunk["amount"].sum()
    running_total += chunk_total
    total_rows += len(chunk)
    
    print(f"Chunk {i+1}: {len(chunk)} rows, partial sum = ${chunk_total:,.2f}")

print(f"Final total: ${running_total:,.2f} over {total_rows:,} rows")

Manual chunking with file.read(size) works for binary files or when you need custom chunk logic — yield fixed-size byte blocks.


def read_in_chunks(file_path: str, chunk_size: int = 1024 * 1024):  # 1MB chunks
    """Yield chunks from a large binary file."""
    with open(file_path, "rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            yield chunk

# Process a 50GB file safely
for i, chunk in enumerate(read_in_chunks("massive.bin"), start=1):
    # Example: hash chunk, compress, upload
    print(f"Processed chunk {i} ({len(chunk):,} bytes)")

Real-world pattern: processing large JSONL (JSON Lines) files — one JSON object per line — ideal for logs, event streams, or big datasets. Use a generator to yield parsed records.


import json

def valid_jsonl_records(file_path: str, min_value: float = 0.0):
    """Yield only valid, high-value JSONL records."""
    with open(file_path, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
                if record.get("value", 0) >= min_value:
                    yield record
            except json.JSONDecodeError as e:
                print(f"Invalid JSON on line {line_num}: {e}")

# Take first 100 valid high-value records
for record in islice(valid_jsonl_records("events.jsonl", min_value=1000), 100):
    print(record["event_type"], record["value"])

Best practices make chunked loading safe, fast, and maintainable. Always use with open(...) — auto-closes files even on exceptions. Specify encoding="utf-8" — avoids UnicodeDecodeError on real-world files. Choose chunk size wisely — 10k–100k rows for pandas, 1–10MB for binary — balance memory vs. I/O overhead. Handle per-chunk errors — wrap processing in try/except — skip bad chunks instead of crashing. Modern tip: prefer Polars pl.scan_csv("huge.csv").filter(...).collect(streaming=True) — 10–100× faster than pandas chunking, with lazy evaluation and streaming. Use generators (yield) for custom chunking — keeps memory low and composable. In production, add progress tracking (tqdm) and logging — monitor processed rows, errors, and estimated time remaining. Combine with itertools.islice() to limit chunks or itertools.takewhile() for conditional stopping.

Loading data in chunks is how Python scales to massive files — one piece at a time, using constant memory. In 2026, master pandas chunking, manual generators, Polars streaming, and per-chunk error handling. You’ll process gigabytes or terabytes on laptops, avoid OOM crashes, and write clean, production-grade data pipelines.

Next time you face a large file — don’t load it all. Open it, chunk it, iterate, and let Python do the heavy lifting safely.

Generating content...