Aggregating with Generators is a powerful, memory-efficient technique for computing sums, averages, counts, or other aggregates over large datasets — especially when the full data cannot fit into RAM. By using generator expressions or generator functions with yield, you process values lazily (one at a time), apply filtering or transformations on-the-fly, and feed them directly into built-in aggregators like sum(), sum(1 for ...) (count), or custom reduce functions — avoiding intermediate lists and minimizing memory usage. In 2026, this pattern is essential for big data ETL, streaming analysis, chunked CSV processing, and large-scale computations in pandas/Polars pipelines — it prevents OOM errors, scales to massive or infinite data, and integrates seamlessly with pd.read_csv(chunksize=...) or Polars lazy evaluation for even greater efficiency.
Here’s a complete, practical guide to aggregating with generators in Python: generator expression aggregation, filtering while summing/counting, chunked CSV aggregation, custom accumulators, real-world patterns, and modern best practices with type hints, memory optimization, Polars lazy equivalents, and performance tips.
Basic generator aggregation — filter and sum lazily without creating a list.
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Sum even numbers using generator expression
even_sum = sum(num for num in numbers if num % 2 == 0)
print(even_sum) # 30
# Count positives > 5
positive_count = sum(1 for num in numbers if num > 5)
print(positive_count) # 4
Chunked CSV aggregation with generators — process large files in batches, aggregate running totals without full load.
import pandas as pd
def value_sum_chunks(file_path: str, chunksize: int = 100_000):
"""Generator: yield running total after each chunk."""
running_total = 0
for chunk in pd.read_csv(file_path, usecols=['value'], chunksize=chunksize):
# Filter & sum in chunk
filtered_sum = chunk[chunk['value'] > 100]['value'].sum()
running_total += filtered_sum
yield running_total # progress update
# Example: print running total after each chunk
for total in value_sum_chunks('large_sales.csv'):
print(f"Running filtered sum: {total}")
Custom aggregation with reduce — use functools.reduce on generator for complex accumulators (e.g., min/max, custom stats).
from functools import reduce
def running_max_chunks(file_path: str):
"""Generator: yield current max after each chunk."""
current_max = float('-inf')
for chunk in pd.read_csv(file_path, usecols=['value'], chunksize=100_000):
chunk_max = chunk['value'].max()
current_max = max(current_max, chunk_max)
yield current_max
for max_so_far in running_max_chunks('large_data.csv'):
print(f"Current max: {max_so_far}")
Real-world pattern: filtered sum across multiple files — aggregate sales or metrics from partitioned CSVs without full load.
def filtered_sum_from_files(file_paths: list[str], column: str = 'sales', min_value: float = 100):
total = 0
for path in file_paths:
for chunk in pd.read_csv(path, usecols=[column], chunksize=100_000):
filtered_sum = chunk[chunk[column] >= min_value][column].sum()
total += filtered_sum
yield total # progress
files = ['sales_2024_01.csv', 'sales_2024_02.csv', 'sales_2024_03.csv']
for running_total in filtered_sum_from_files(files):
print(f"Running filtered sales sum: {running_total}")
Best practices make generator aggregation safe, efficient, and scalable. Use generator expressions — sum(... for ... if ...) — lazy, no intermediate list. Prefer Polars lazy aggregation — pl.scan_csv(...).filter(...).select(pl.col('value').sum()).collect() — often 2–10× faster/lower memory than pandas chunks. Use usecols — read only needed columns. Use dtype — downcast early (float32/int32). Filter early — discard rows before summing. Yield progress — for long-running aggregations. Write partial results to disk — Parquet per chunk/file for resumability. Monitor memory — psutil.Process().memory_info().rss during loop. Add type hints — def aggregate_chunks(paths: list[str]) -> Iterator[float]. Handle errors per chunk/file — try/except, log failures. Use reduce — for custom accumulators (min/max/mean). Use itertools.accumulate — running totals with generator. Use Polars .group_by(...).agg(...) — vectorized, lazy aggregation across files. Use tqdm — progress bar for multi-file loops. Test aggregators — assert total correct on small subset.
Aggregating with generators computes sums/counts/max/etc. lazily — filter on-the-fly, process large/infinite data with minimal memory. In 2026, prefer generator expressions, Polars lazy filter().sum(), usecols/dtype, progress yielding, and memory monitoring. Master this pattern, and you’ll handle massive datasets scalably, reliably, and with near-zero memory overhead.
Next time you need to aggregate large data — use generators. It’s Python’s cleanest way to say: “Sum/filter/count everything — without ever loading it all.”