Computing the fraction of long trips

Computing the fraction of long trips with generators is a perfect example of memory-efficient, streaming data analysis in Python — ideal when trip duration data is in a large CSV (or any file) that won’t fit in RAM. By reading the file line-by-line with a generator, parsing durations on-the-fly, filtering long trips (> threshold), and yielding the running fraction after each valid row, you compute the statistic progressively with almost constant memory usage. In 2026, this streaming + generator pattern remains essential for big data ETL, real-time analytics, and scalable processing — especially when combined with pandas read_csv(chunksize=...), Polars lazy scanning, or custom line generators for gigabyte-scale logs, sensor data, or transaction records.

Here’s a complete, practical guide to computing running fractions (e.g., long trips) with generators in Python: line-by-line generator, running fraction calculation, handling errors/edge cases, chunked CSV version, real-world patterns, and modern best practices with type hints, Polars lazy equivalents, and performance/memory optimization.

Basic line-by-line generator + running fraction — yield fraction after each valid row.


def read_data(filename: str):
    """Generator: yield parsed rows (list of strings) from file."""
    with open(filename, 'r') as f:
        for line in f:
            yield line.strip().split(',')

def long_trip_fraction(data, threshold: float = 60.0):
    """Generator: yield running fraction of trips longer than threshold."""
    num_long = 0
    num_total = 0
    for row in data:
        try:
            duration = float(row[0])  # assume first column is duration
            num_total += 1
            if duration > threshold:
                num_long += 1
            yield num_long / num_total if num_total > 0 else 0.0
        except (ValueError, IndexError):
            continue  # skip bad/invalid rows

# Example usage
data_gen = read_data('trips.csv')
fractions_gen = long_trip_fraction(data_gen, threshold=60.0)

for fraction in fractions_gen:
    print(f"Running fraction of long trips: {fraction:.4f}")

Chunked CSV version — combine pandas chunking with generator for more structured data.


import pandas as pd

def long_trip_fraction_chunks(file_path: str, duration_col: str = 'duration', threshold: float = 60.0, chunksize: int = 100_000):
    """Generator: yield running fraction after each chunk."""
    num_long = 0
    num_total = 0
    for chunk in pd.read_csv(file_path, usecols=[duration_col], chunksize=chunksize):
        # Filter valid durations and count long trips
        valid = chunk[pd.to_numeric(chunk[duration_col], errors='coerce').notna()]
        num_total += len(valid)
        num_long += (valid[duration_col] > threshold).sum()
        yield num_long / num_total if num_total > 0 else 0.0

# Usage
for fraction in long_trip_fraction_chunks('large_trips.csv'):
    print(f"Running fraction: {fraction:.4f}")

Real-world pattern: fraction of long trips across multiple partitioned files — aggregate streaming across daily/monthly CSVs.


def long_trip_fraction_multi_files(file_paths: list[str], duration_col: str = 'duration', threshold: float = 60.0):
    num_long = 0
    num_total = 0
    for path in file_paths:
        for chunk in pd.read_csv(path, usecols=[duration_col], chunksize=100_000):
            valid = chunk[pd.to_numeric(chunk[duration_col], errors='coerce').notna()]
            num_total += len(valid)
            num_long += (valid[duration_col] > threshold).sum()
            yield num_long / num_total if num_total > 0 else 0.0

files = ['trips_2024_01.csv', 'trips_2024_02.csv', 'trips_2024_03.csv']
for fraction in long_trip_fraction_multi_files(files):
    print(f"Running fraction of long trips: {fraction:.4f}")

Best practices make generator-based fraction computation safe, efficient, and scalable. Use generator expressions or yield — avoid materializing full lists. Modern tip: prefer Polars lazy — pl.scan_csv(...).filter(pl.col(duration_col) > threshold).select(pl.count() / pl.len()).collect() — often faster/lower memory. Use usecols + dtype — read only needed columns, downcast (float32). Handle parsing errors — pd.to_numeric(..., errors='coerce') + .notna(). Yield after each chunk/row — for progress monitoring. Write partial results — e.g., append running fraction to log file. Monitor memory — psutil.Process().memory_info().rss during loop. Add type hints — def fraction_gen(data: Iterable[list[str]], threshold: float) -> Iterator[float]. Use tqdm — progress bar for long files. Use itertools.accumulate — for running totals if needed. Use Polars .cumulative() or .with_row_count() — for running stats in lazy mode. Test on small subset — assert final fraction correct. Use gc.collect() after large chunks — force cleanup if needed.

Computing running fractions with generators processes large trip data efficiently — filter durations on-the-fly, yield fraction after each row/chunk, minimal memory usage. In 2026, prefer generator expressions, Polars lazy filter().select(pl.count()/pl.len()), usecols/dtype, error handling, and progress yielding. Master this pattern, and you’ll compute statistics on massive/infinite data scalably and reliably.

Next time you need fraction of long trips or any running statistic — use generators. It’s Python’s cleanest way to say: “Compute as I go — never load it all.”

Generating content...