Aggregating with Generators

Aggregating with Generators is a powerful, memory-efficient technique for computing sums, averages, counts, or other aggregates over large datasets — especially when the full data cannot fit into RAM. By using generator expressions or generator functions with yield, you process values lazily (one at a time), apply filtering or transformations on-the-fly, and feed them directly into built-in aggregators like sum(), sum(1 for ...) (count), or custom reduce functions — avoiding intermediate lists and minimizing memory usage. In 2026, this pattern is essential for big data ETL, streaming analysis, chunked CSV processing, and large-scale computations in pandas/Polars pipelines — it prevents OOM errors, scales to massive or infinite data, and integrates seamlessly with pd.read_csv(chunksize=...) or Polars lazy evaluation for even greater efficiency.

Here’s a complete, practical guide to aggregating with generators in Python: generator expression aggregation, filtering while summing/counting, chunked CSV aggregation, custom accumulators, real-world patterns, and modern best practices with type hints, memory optimization, Polars lazy equivalents, and performance tips.

Basic generator aggregation — filter and sum lazily without creating a list.


numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Sum even numbers using generator expression
even_sum = sum(num for num in numbers if num % 2 == 0)
print(even_sum)  # 30

# Count positives > 5
positive_count = sum(1 for num in numbers if num > 5)
print(positive_count)  # 4

Chunked CSV aggregation with generators — process large files in batches, aggregate running totals without full load.


import pandas as pd

def value_sum_chunks(file_path: str, chunksize: int = 100_000):
    """Generator: yield running total after each chunk."""
    running_total = 0
    for chunk in pd.read_csv(file_path, usecols=['value'], chunksize=chunksize):
        # Filter & sum in chunk
        filtered_sum = chunk[chunk['value'] > 100]['value'].sum()
        running_total += filtered_sum
        yield running_total  # progress update

# Example: print running total after each chunk
for total in value_sum_chunks('large_sales.csv'):
    print(f"Running filtered sum: {total}")

Custom aggregation with reduce — use functools.reduce on generator for complex accumulators (e.g., min/max, custom stats).


from functools import reduce

def running_max_chunks(file_path: str):
    """Generator: yield current max after each chunk."""
    current_max = float('-inf')
    for chunk in pd.read_csv(file_path, usecols=['value'], chunksize=100_000):
        chunk_max = chunk['value'].max()
        current_max = max(current_max, chunk_max)
        yield current_max

for max_so_far in running_max_chunks('large_data.csv'):
    print(f"Current max: {max_so_far}")

Real-world pattern: filtered sum across multiple files — aggregate sales or metrics from partitioned CSVs without full load.


def filtered_sum_from_files(file_paths: list[str], column: str = 'sales', min_value: float = 100):
    total = 0
    for path in file_paths:
        for chunk in pd.read_csv(path, usecols=[column], chunksize=100_000):
            filtered_sum = chunk[chunk[column] >= min_value][column].sum()
            total += filtered_sum
            yield total  # progress

files = ['sales_2024_01.csv', 'sales_2024_02.csv', 'sales_2024_03.csv']
for running_total in filtered_sum_from_files(files):
    print(f"Running filtered sales sum: {running_total}")

Best practices make generator aggregation safe, efficient, and scalable. Use generator expressions — sum(... for ... if ...) — lazy, no intermediate list. Prefer Polars lazy aggregation — pl.scan_csv(...).filter(...).select(pl.col('value').sum()).collect() — often 2–10× faster/lower memory than pandas chunks. Use usecols — read only needed columns. Use dtype — downcast early (float32/int32). Filter early — discard rows before summing. Yield progress — for long-running aggregations. Write partial results to disk — Parquet per chunk/file for resumability. Monitor memory — psutil.Process().memory_info().rss during loop. Add type hints — def aggregate_chunks(paths: list[str]) -> Iterator[float]. Handle errors per chunk/file — try/except, log failures. Use reduce — for custom accumulators (min/max/mean). Use itertools.accumulate — running totals with generator. Use Polars .group_by(...).agg(...) — vectorized, lazy aggregation across files. Use tqdm — progress bar for multi-file loops. Test aggregators — assert total correct on small subset.

Aggregating with generators computes sums/counts/max/etc. lazily — filter on-the-fly, process large/infinite data with minimal memory. In 2026, prefer generator expressions, Polars lazy filter().sum(), usecols/dtype, progress yielding, and memory monitoring. Master this pattern, and you’ll handle massive datasets scalably, reliably, and with near-zero memory overhead.

Next time you need to aggregate large data — use generators. It’s Python’s cleanest way to say: “Sum/filter/count everything — without ever loading it all.”

Generating content...