Managing Data with Generators

Managing Data with Generators is the most memory-efficient and scalable approach for handling large datasets in Python — especially when files exceed RAM limits or you’re streaming data in real time. Generators (via yield) and generator expressions produce values lazily, one at a time, avoiding loading everything into memory. Combined with pd.read_csv(chunksize=...), they enable processing gigabyte-scale CSVs, filtering/transforming/aggregating on-the-fly, and chaining operations without intermediate lists or full DataFrames. In 2026, this pattern is foundational for big data ETL, cleaning, feature engineering, and exploratory analysis in pandas/Polars pipelines — it prevents OOM errors, reduces peak memory by 90%+, and pairs perfectly with lazy evaluation in Polars for even greater efficiency.

Here’s a complete, practical guide to managing data with generators in Python: generator basics, chunked CSV reading, filtering/processing per chunk, aggregating results, real-world patterns, and modern best practices with type hints, memory optimization, error handling, and Polars lazy equivalents.

Generator basics — yield produces values lazily; functions with yield return generators.


def read_lines(file_path: str):
    """Generator: yield lines one at a time."""
    with open(file_path) as f:
        for line in f:
            yield line.strip()

# Usage: process line by line (almost zero memory)
for line in read_lines('huge_log.txt'):
    if "ERROR" in line:
        print(line)  # only errors in memory at once

Chunked CSV with generator — wrap pd.read_csv(chunksize=...) in a generator function for clean iteration.


def read_csv_chunks(file_path: str, chunksize: int = 100_000):
    """Generator: yield pandas DataFrame chunks."""
    for chunk in pd.read_csv(file_path, chunksize=chunksize):
        yield chunk

# Process chunks lazily
for chunk in read_csv_chunks('large_dataset.csv'):
    # Filter/transform in memory (only one chunk at a time)
    filtered = chunk[chunk['value'] > 100]
    # ... compute stats, save to DB, etc.
    print(f"Processed {len(filtered)} rows")

Filtering & processing inside generator — keep only relevant data, discard immediately.


def filter_and_transform_chunks(file_path: str, chunksize: int = 100_000):
    """Generator: yield filtered/transformed chunks."""
    for chunk in pd.read_csv(file_path, chunksize=chunksize):
        # Filter example
        filtered = chunk[(chunk['category'] == 'A') & (chunk['value'] > 100)]
        # Transform
        filtered['value_doubled'] = filtered['value'] * 2
        if not filtered.empty:
            yield filtered

# Collect or process further
filtered_results = []
for chunk in filter_and_transform_chunks('large_file.csv'):
    filtered_results.append(chunk)

df_final = pd.concat(filtered_results, ignore_index=True)
print(f"Total filtered rows: {len(df_final)}")

Real-world pattern: memory-safe aggregation across chunks — compute running totals/stats without full load.


def running_stats(file_path: str, chunksize: int = 100_000):
    total = 0
    count = 0
    for chunk in pd.read_csv(file_path, chunksize=chunksize):
        filtered = chunk[chunk['value'] > 100]
        total += filtered['value'].sum()
        count += len(filtered)
        yield {'chunk_total': total, 'chunk_count': count}  # progress

for stats in running_stats('sales_large.csv'):
    print(f"Running avg: {stats['chunk_total']/stats['chunk_count']:.2f}")

Best practices make generator-based chunk processing safe, efficient, and scalable. Use generators everywhere possible — avoid list() on large iterables; prefer for chunk in ...: yield processed_chunk. Modern tip: prefer Polars scan_csv().filter(...).sink_parquet() — lazy, streaming, no manual chunking needed. Set chunksize reasonably — 10k–100k rows balances memory vs I/O. Use usecols — read only needed columns: pd.read_csv(..., usecols=['date', 'value']). Use dtype — downcast early: dtype={'value': 'float32'}. Filter early — discard rows before any heavy computation. Write per chunk to disk — chunk.to_parquet(f'chunk_{i}.parquet') for resumability. Monitor memory — psutil.Process().memory_info().rss before/after chunks. Add type hints — def process_chunk(chunk: pd.DataFrame) -> pd.DataFrame. Handle errors per chunk — try/except inside loop. Use low_memory=False — avoid mixed-type warnings. Use engine='pyarrow' — faster CSV parsing. Use Polars lazy for full power — pl.scan_csv().filter(...).collect() or .sink_* — often 2–10× faster/lower memory than pandas chunks.

Chunking with filtering via generators processes large CSVs efficiently — read batch, filter/transform immediately, yield only relevant data. In 2026, prefer Polars lazy scanning/filtering, use usecols/dtype, write per chunk to Parquet, and monitor memory with psutil. Master generator-based chunking & filtering, and you’ll handle massive datasets scalably, reliably, and with minimal memory footprint.

Next time you process a huge CSV — use generators and chunks. It’s Python’s cleanest way to say: “Handle big data piece by piece — keep memory low, speed high.”

Generating content...