Sequences to bags

Sequences to bags is a simple yet powerful way to turn Python iterables (lists, generators, ranges, sets, etc.) into Dask Bags — enabling parallel map/filter/reduce operations on sequences that may be too large to hold in memory or benefit from distributed processing. In 2026, db.from_sequence() remains the go-to method for converting in-memory or lazily generated data into a parallelizable, chunked Bag — perfect for processing large lists of files, numbers, records, or custom objects in parallel across cores or clusters, with minimal overhead and full integration with Dask’s ecosystem (delayed, DataFrames, Arrays, xarray).

Here’s a complete, practical guide to converting sequences to Dask Bags: basic from_sequence usage, handling generators/ranges/lists, partitioning control, real-world patterns (file paths, numbers, records), and modern best practices with chunking, parallelism, visualization, distributed execution, and Polars equivalents.

Basic conversion — from list, range, generator to Dask Bag.


import dask.bag as db

# From list
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
bag_list = db.from_sequence(numbers)
print(bag_list)  # dask.bag, partitions=10 (default)

# From range (memory-efficient for large sequences)
bag_range = db.from_sequence(range(1_000_000))
print(bag_range.count().compute())  # 1000000 (parallel count)

# From generator (lazy, no full materialization)
def gen_records():
    for i in range(1000):
        yield {'id': i, 'value': i * 2}

bag_gen = db.from_sequence(gen_records())
print(bag_gen.map(lambda r: r['value']).sum().compute())  # parallel sum

Partitioning control — choose number of partitions for parallelism.


# Explicit partitions (more = more parallelism, smaller chunks)
bag_partitioned = db.from_sequence(range(1_000_000), npartitions=100)
print(bag_partitioned.npartitions)  # 100

# From list with custom partitions
bag_custom = db.from_sequence([f'data/file_{i:03d}.txt' for i in range(500)], npartitions=50)
print(bag_custom.npartitions)  # 50

# Auto-partitioning (Dask chooses based on size)
bag_auto = db.from_sequence(large_list)  # partitions chosen automatically

Real-world pattern: processing large list of files or records in parallel — common in earthquake metadata or log analysis.


# List of earthquake JSON files (e.g., daily exports)
quake_files = [f'quakes/day_{i:03d}.json' for i in range(365)]

# Bag of file paths ? map to load & parse ? filter strong events
bag_paths = db.from_sequence(quake_files, npartitions=50)

strong_events = (
    bag_paths
    .map(json.load)  # parse each file
    .flatten()       # flatten list of events per file
    .filter(lambda e: e.get('mag', 0) >= 6.0)
    .map(lambda e: {'time': e['time'], 'mag': e['mag'], 'lat': e['latitude'], 'lon': e['longitude']})
)

# Count strong events
print(strong_events.count().compute())

# Take sample for inspection
sample = strong_events.take(5)
print(sample)

# Convert to Dask DataFrame for further analysis
ddf_strong = strong_events.to_dataframe().persist()
print(ddf_strong.head())

Best practices for sequences to Dask Bags. Use npartitions wisely — 10–100 for most machines; more for clusters. Modern tip: use Polars for columnar/structured data — pl.scan_ndjson('files/*.jsonl') — faster for JSONL; use Bags for truly unstructured/heterogeneous sequences. Prefer generators — memory-efficient for large/infinite sequences. Use from_filenames() — when processing whole files (e.g., parse per file). Visualize graph — strong_events.visualize() to debug. Persist hot bags — strong_events.persist() for repeated ops. Use distributed client — Client() for clusters. Add type hints — @delayed def parse_file(path: str) -> list[dict]. Monitor dashboard — memory/tasks/progress. Avoid large reductions early — filter/map first. Use flatten() — after file ? lines. Use pluck() — extract fields efficiently. Use to_dataframe() — convert to Dask DataFrame when tabular structure emerges. Use db.from_sequence(..., partition_size='64MB') — control chunk size. Profile with timeit — compare bag vs pandas loops.

Converting sequences to Dask Bags with db.from_sequence() enables parallel processing of lists/generators/ranges — control partitions, map/filter/reduce lazily, visualize graphs, and transition to DataFrames when needed. In 2026, use npartitions wisely, persist intermediates, prefer Polars for structured data, and monitor dashboard. Master sequences to bags, and you’ll parallelize any iterable efficiently and scalably.

Next time you have a large list or generator — turn it into a Dask Bag. It’s Python’s cleanest way to say: “Make this sequence parallel — map/filter/reduce it across cores.”

Generating content...