Generators for the large data limit

Generators for the large data limit are one of Python’s most powerful tools for handling massive datasets — files, streams, logs, CSVs, JSONL, or API responses that are too big to fit in memory. Instead of loading everything at once (which causes OOM crashes), generators yield data one item, one line, or one chunk at a time — using constant memory and enabling lazy, on-demand processing. In 2026, generators (via yield in functions or generator expressions) are essential for scalable data pipelines, ETL jobs, streaming analysis, and production code that works with gigabytes or terabytes on modest hardware.

Here’s a complete, practical guide to using generators for large data: basic yield, chunked file reading, real-world CSV/JSONL patterns, memory benefits, and modern best practices with Polars, error handling, and streaming efficiency.

The core idea: a generator function uses yield to produce values one at a time. When called, it returns a generator object (an iterator) that you can loop over with for or next() — execution pauses at each yield and resumes on the next request. This keeps memory low even for huge sequences.


def count_up_to(n: int):
    """Yield numbers from 0 to n-1 lazily."""
    i = 0
    while i < n:
        yield i
        i += 1

# Process huge range without storing list
for num in count_up_to(1_000_000_000):
    if num % 1_000_000 == 0:
        print(f"Processed {num:,}")
    # ... do something useful
    # Never loads all 1 billion numbers into RAM

Classic pattern: reading large files line by line — the file object is already a generator; loop over it directly for constant-memory processing.


def process_large_log(file_path: str):
    """Yield cleaned lines from a huge log file."""
    with open(file_path, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, start=1):
            clean = line.strip()
            if clean:
                yield (line_num, clean)

# Example usage: find errors without loading file
for num, line in process_large_log("10gb_log.txt"):
    if "ERROR" in line.upper():
        print(f"Error on line {num}: {line}")

Chunked reading for non-line-based files or when you need fixed-size blocks — yield byte chunks or row batches.


def read_in_chunks(file_path: str, chunk_size: int = 1024 * 1024):  # 1MB chunks
    """Yield chunks from a large binary file."""
    with open(file_path, "rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            yield chunk

# Process a 50GB file safely
for i, chunk in enumerate(read_in_chunks("massive.bin"), start=1):
    # Example: hash chunk, compress, upload
    print(f"Processed chunk {i} ({len(chunk):,} bytes)")

Real-world pattern: chunked CSV processing — yield batches of rows for aggregation, filtering, or database inserts without loading the full file.


import csv

def csv_chunks(file_path: str, chunk_size: int = 100_000):
    """Yield CSV row batches from a large file."""
    with open(file_path, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        chunk = []
        for row in reader:
            chunk.append(row)
            if len(chunk) == chunk_size:
                yield chunk
                chunk = []
        if chunk:
            yield chunk

# Aggregate totals across chunks
total_sales = 0.0
for batch in csv_chunks("sales_1tb.csv"):
    batch_total = sum(float(row["amount"]) for row in batch if row.get("amount", "").isdigit())
    total_sales += batch_total
    print(f"Processed batch — running total: ${total_sales:,.2f}")

Best practices make generators safe, efficient, and maintainable. Use yield for lazy generation — never build full lists inside generators. Always use with open(...) — auto-closes files even on exceptions. Specify encoding="utf-8" — avoids UnicodeDecodeError. Handle per-yield errors — wrap processing in try/except — skip bad items instead of crashing. Modern tip: prefer Polars pl.scan_csv("huge.csv").filter(...).collect(streaming=True) — 10–100× faster than manual chunking, with lazy evaluation and streaming. Use itertools (islice, takewhile, chain) for advanced slicing/filtering of generators. In production, add progress tracking (tqdm) and logging — monitor processed items, errors, and estimated time. Prefer generators over lists for one-pass or large data — convert to list only when random access is required (list(gen)).

Generators are Python’s solution to the large data limit — process gigabytes or terabytes one piece at a time with constant memory. In 2026, build them with yield, use chunking for files, and lean on Polars streaming for structured data. Master generators, and you’ll handle massive datasets, files, and streams with confidence and scalability.

Next time you face a huge dataset — don’t load it all. Write a generator or use streaming tools. It’s Python’s cleanest way to say: “Here’s the data, one item at a time — no memory explosion.”

Generating content...