Reading many files

Managing multiple files with generators is the most memory-efficient and scalable way to process many large files in Python — instead of loading each file fully into memory, you use generators to read and yield data (lines, chunks, records) one at a time or in small batches. This approach prevents OOM errors on gigabyte-scale datasets, enables streaming ETL, filtering/transforming on-the-fly, and chaining operations lazily. In 2026, this pattern is essential for big data workflows — reading CSVs/JSON/logs/databases incrementally, processing in chunks, aggregating results, and integrating with pandas read_csv(chunksize=...), Polars scan_*(), or custom line-by-line generators. It keeps peak memory low, scales to thousands of files, and pairs perfectly with yield from, itertools, and lazy evaluation.

Here’s a complete, practical guide to managing data from multiple files with generators in Python: directory iteration, line-by-line generators, chunked CSV/JSON reading, filtering/processing on-the-fly, aggregating results, real-world patterns, and modern best practices with type hints, error handling, Polars lazy equivalents, and memory optimization.

Basic directory iteration with generators — yield file paths or contents lazily from a directory.


import os
from pathlib import Path
from typing import Iterator

def txt_files_in_dir(directory: str | Path) -> Iterator[Path]:
    """Generator: yield .txt file paths in directory."""
    dir_path = Path(directory)
    for file_path in dir_path.glob("*.txt"):
        if file_path.is_file():
            yield file_path

# Usage: process each file one at a time
for txt_file in txt_files_in_dir("/path/to/files"):
    print(f"Processing: {txt_file}")
    with txt_file.open("r") as f:
        content = f.read()  # only one file in memory
        # ... process content ...

Line-by-line generator across multiple files — yield all lines from all files lazily.


def all_lines_from_txt_files(directory: str | Path) -> Iterator[str]:
    """Generator: yield every line from all .txt files."""
    for file_path in txt_files_in_dir(directory):
        with file_path.open("r") as f:
            yield from f  # yield lines one by one

# Example: count lines containing "ERROR"
error_count = sum(1 for line in all_lines_from_txt_files("/logs") if "ERROR" in line)
print(f"Total errors: {error_count}")

Chunked CSV processing with generator — yield filtered/transformed chunks from multiple files.


def process_csv_chunks(file_paths: list[str | Path], chunksize: int = 100_000):
    """Generator: yield processed DataFrame chunks from multiple CSVs."""
    for path in file_paths:
        for chunk in pd.read_csv(path, chunksize=chunksize):
            # Filter & transform example
            filtered = chunk[chunk['value'] > 100]
            filtered['value_doubled'] = filtered['value'] * 2
            if not filtered.empty:
                yield filtered

# Aggregate all filtered chunks
filtered_chunks = list(process_csv_chunks(["sales1.csv", "sales2.csv"]))
df_final = pd.concat(filtered_chunks, ignore_index=True) if filtered_chunks else pd.DataFrame()
print(f"Total filtered rows: {len(df_final)}")

Real-world pattern: memory-safe aggregation across many files — compute running totals/stats without full load.


def running_total_from_files(file_paths: list[str | Path], column: str = "value"):
    total = 0
    for path in file_paths:
        for chunk in pd.read_csv(path, usecols=[column], chunksize=100_000):
            total += chunk[column].sum()
            yield total  # progress update

# Example usage
files = ["data/part1.csv", "data/part2.csv"]
for current_total in running_total_from_files(files):
    print(f"Running total: {current_total}")

Best practices make multi-file generator processing safe, efficient, and scalable. Use Path.glob() or os.scandir() — efficient directory iteration. Prefer yield from — clean delegation to sub-iterators/files. Modern tip: prefer Polars pl.scan_csv(file_paths) — lazy concatenation of multiple files, filter once, no manual chunking. Use usecols + dtype — read only needed columns, downcast types. Filter early — discard rows before heavy computation. Write per file/chunk to disk — Parquet/CSV incrementally for resumability. Monitor memory — psutil.Process().memory_info().rss before/after files. Add type hints — def read_chunks(paths: list[Path]) -> Iterator[pd.DataFrame]. Handle errors per file — try/except in loop, log failures. Use concurrent.futures — parallel file reading if I/O-bound. Use Polars pl.read_csv(file_paths) — eager multi-file read, or scan_* for lazy. Use pathlib — clean path handling. Use tqdm — progress bar for long-running multi-file loops. Test generators — consume partially, assert results correct.

Managing multiple files with generators processes large datasets efficiently — yield lines/chunks lazily, filter/transform on-the-fly, aggregate without full load. In 2026, prefer Polars scan_csv(file_paths) for lazy multi-file handling, use usecols/dtype, write incrementally to Parquet, and monitor memory with psutil. Master multi-file generators, and you’ll handle massive, distributed datasets scalably and reliably.

Next time you have many large files — use generators to process them. It’s Python’s cleanest way to say: “Handle big data one piece at a time — across all files.”

Generating content...