Managing Data with Generators is the most memory-efficient and scalable approach for handling large datasets in Python — especially when files exceed RAM limits or you’re streaming data in real time. Generators (via yield) and generator expressions produce values lazily, one at a time, avoiding loading everything into memory. Combined with pd.read_csv(chunksize=...), they enable processing gigabyte-scale CSVs, filtering/transforming/aggregating on-the-fly, and chaining operations without intermediate lists or full DataFrames. In 2026, this pattern is foundational for big data ETL, cleaning, feature engineering, and exploratory analysis in pandas/Polars pipelines — it prevents OOM errors, reduces peak memory by 90%+, and pairs perfectly with lazy evaluation in Polars for even greater efficiency.
Here’s a complete, practical guide to managing data with generators in Python: generator basics, chunked CSV reading, filtering/processing per chunk, aggregating results, real-world patterns, and modern best practices with type hints, memory optimization, error handling, and Polars lazy equivalents.
Generator basics — yield produces values lazily; functions with yield return generators.
def read_lines(file_path: str):
"""Generator: yield lines one at a time."""
with open(file_path) as f:
for line in f:
yield line.strip()
# Usage: process line by line (almost zero memory)
for line in read_lines('huge_log.txt'):
if "ERROR" in line:
print(line) # only errors in memory at once
Chunked CSV with generator — wrap pd.read_csv(chunksize=...) in a generator function for clean iteration.
def read_csv_chunks(file_path: str, chunksize: int = 100_000):
"""Generator: yield pandas DataFrame chunks."""
for chunk in pd.read_csv(file_path, chunksize=chunksize):
yield chunk
# Process chunks lazily
for chunk in read_csv_chunks('large_dataset.csv'):
# Filter/transform in memory (only one chunk at a time)
filtered = chunk[chunk['value'] > 100]
# ... compute stats, save to DB, etc.
print(f"Processed {len(filtered)} rows")
Filtering & processing inside generator — keep only relevant data, discard immediately.
def filter_and_transform_chunks(file_path: str, chunksize: int = 100_000):
"""Generator: yield filtered/transformed chunks."""
for chunk in pd.read_csv(file_path, chunksize=chunksize):
# Filter example
filtered = chunk[(chunk['category'] == 'A') & (chunk['value'] > 100)]
# Transform
filtered['value_doubled'] = filtered['value'] * 2
if not filtered.empty:
yield filtered
# Collect or process further
filtered_results = []
for chunk in filter_and_transform_chunks('large_file.csv'):
filtered_results.append(chunk)
df_final = pd.concat(filtered_results, ignore_index=True)
print(f"Total filtered rows: {len(df_final)}")
Real-world pattern: memory-safe aggregation across chunks — compute running totals/stats without full load.
def running_stats(file_path: str, chunksize: int = 100_000):
total = 0
count = 0
for chunk in pd.read_csv(file_path, chunksize=chunksize):
filtered = chunk[chunk['value'] > 100]
total += filtered['value'].sum()
count += len(filtered)
yield {'chunk_total': total, 'chunk_count': count} # progress
for stats in running_stats('sales_large.csv'):
print(f"Running avg: {stats['chunk_total']/stats['chunk_count']:.2f}")
Best practices make generator-based chunk processing safe, efficient, and scalable. Use generators everywhere possible — avoid list() on large iterables; prefer for chunk in ...: yield processed_chunk. Modern tip: prefer Polars scan_csv().filter(...).sink_parquet() — lazy, streaming, no manual chunking needed. Set chunksize reasonably — 10k–100k rows balances memory vs I/O. Use usecols — read only needed columns: pd.read_csv(..., usecols=['date', 'value']). Use dtype — downcast early: dtype={'value': 'float32'}. Filter early — discard rows before any heavy computation. Write per chunk to disk — chunk.to_parquet(f'chunk_{i}.parquet') for resumability. Monitor memory — psutil.Process().memory_info().rss before/after chunks. Add type hints — def process_chunk(chunk: pd.DataFrame) -> pd.DataFrame. Handle errors per chunk — try/except inside loop. Use low_memory=False — avoid mixed-type warnings. Use engine='pyarrow' — faster CSV parsing. Use Polars lazy for full power — pl.scan_csv().filter(...).collect() or .sink_* — often 2–10× faster/lower memory than pandas chunks.
Chunking with filtering via generators processes large CSVs efficiently — read batch, filter/transform immediately, yield only relevant data. In 2026, prefer Polars lazy scanning/filtering, use usecols/dtype, write per chunk to Parquet, and monitor memory with psutil. Master generator-based chunking & filtering, and you’ll handle massive datasets scalably, reliably, and with minimal memory footprint.
Next time you process a huge CSV — use generators and chunks. It’s Python’s cleanest way to say: “Handle big data piece by piece — keep memory low, speed high.”