Computing the fraction of long trips with generators is a perfect example of memory-efficient, streaming data analysis in Python — ideal when trip duration data is in a large CSV (or any file) that won’t fit in RAM. By reading the file line-by-line with a generator, parsing durations on-the-fly, filtering long trips (> threshold), and yielding the running fraction after each valid row, you compute the statistic progressively with almost constant memory usage. In 2026, this streaming + generator pattern remains essential for big data ETL, real-time analytics, and scalable processing — especially when combined with pandas read_csv(chunksize=...), Polars lazy scanning, or custom line generators for gigabyte-scale logs, sensor data, or transaction records.
Here’s a complete, practical guide to computing running fractions (e.g., long trips) with generators in Python: line-by-line generator, running fraction calculation, handling errors/edge cases, chunked CSV version, real-world patterns, and modern best practices with type hints, Polars lazy equivalents, and performance/memory optimization.
Basic line-by-line generator + running fraction — yield fraction after each valid row.
def read_data(filename: str):
"""Generator: yield parsed rows (list of strings) from file."""
with open(filename, 'r') as f:
for line in f:
yield line.strip().split(',')
def long_trip_fraction(data, threshold: float = 60.0):
"""Generator: yield running fraction of trips longer than threshold."""
num_long = 0
num_total = 0
for row in data:
try:
duration = float(row[0]) # assume first column is duration
num_total += 1
if duration > threshold:
num_long += 1
yield num_long / num_total if num_total > 0 else 0.0
except (ValueError, IndexError):
continue # skip bad/invalid rows
# Example usage
data_gen = read_data('trips.csv')
fractions_gen = long_trip_fraction(data_gen, threshold=60.0)
for fraction in fractions_gen:
print(f"Running fraction of long trips: {fraction:.4f}")
Chunked CSV version — combine pandas chunking with generator for more structured data.
import pandas as pd
def long_trip_fraction_chunks(file_path: str, duration_col: str = 'duration', threshold: float = 60.0, chunksize: int = 100_000):
"""Generator: yield running fraction after each chunk."""
num_long = 0
num_total = 0
for chunk in pd.read_csv(file_path, usecols=[duration_col], chunksize=chunksize):
# Filter valid durations and count long trips
valid = chunk[pd.to_numeric(chunk[duration_col], errors='coerce').notna()]
num_total += len(valid)
num_long += (valid[duration_col] > threshold).sum()
yield num_long / num_total if num_total > 0 else 0.0
# Usage
for fraction in long_trip_fraction_chunks('large_trips.csv'):
print(f"Running fraction: {fraction:.4f}")
Real-world pattern: fraction of long trips across multiple partitioned files — aggregate streaming across daily/monthly CSVs.
def long_trip_fraction_multi_files(file_paths: list[str], duration_col: str = 'duration', threshold: float = 60.0):
num_long = 0
num_total = 0
for path in file_paths:
for chunk in pd.read_csv(path, usecols=[duration_col], chunksize=100_000):
valid = chunk[pd.to_numeric(chunk[duration_col], errors='coerce').notna()]
num_total += len(valid)
num_long += (valid[duration_col] > threshold).sum()
yield num_long / num_total if num_total > 0 else 0.0
files = ['trips_2024_01.csv', 'trips_2024_02.csv', 'trips_2024_03.csv']
for fraction in long_trip_fraction_multi_files(files):
print(f"Running fraction of long trips: {fraction:.4f}")
Best practices make generator-based fraction computation safe, efficient, and scalable. Use generator expressions or yield — avoid materializing full lists. Modern tip: prefer Polars lazy — pl.scan_csv(...).filter(pl.col(duration_col) > threshold).select(pl.count() / pl.len()).collect() — often faster/lower memory. Use usecols + dtype — read only needed columns, downcast (float32). Handle parsing errors — pd.to_numeric(..., errors='coerce') + .notna(). Yield after each chunk/row — for progress monitoring. Write partial results — e.g., append running fraction to log file. Monitor memory — psutil.Process().memory_info().rss during loop. Add type hints — def fraction_gen(data: Iterable[list[str]], threshold: float) -> Iterator[float]. Use tqdm — progress bar for long files. Use itertools.accumulate — for running totals if needed. Use Polars .cumulative() or .with_row_count() — for running stats in lazy mode. Test on small subset — assert final fraction correct. Use gc.collect() after large chunks — force cleanup if needed.
Computing running fractions with generators processes large trip data efficiently — filter durations on-the-fly, yield fraction after each row/chunk, minimal memory usage. In 2026, prefer generator expressions, Polars lazy filter().select(pl.count()/pl.len()), usecols/dtype, error handling, and progress yielding. Master this pattern, and you’ll compute statistics on massive/infinite data scalably and reliably.
Next time you need fraction of long trips or any running statistic — use generators. It’s Python’s cleanest way to say: “Compute as I go — never load it all.”