Sequences to Bags with Dask in Python 2026 – Best Practices
Dask Bags are excellent for processing sequences of Python objects such as lists, tuples, or custom records. Converting a Python sequence (or generator) into a Dask Bag enables parallel and distributed processing with minimal memory overhead.
TL;DR — Key Methods
db.from_sequence()— Convert a Python list, tuple, or generator into a Dask Bagdb.from_delayed()— Convert delayed objects into a Bag- Use
.map(),.filter(), and.pluck()for transformations - Convert to Dask DataFrame when structure emerges
1. Basic Sequence to Bag
import dask.bag as db
# From a Python list
data = [{"id": i, "value": i*10, "region": "EU" if i % 3 == 0 else "US"} for i in range(100_000)]
bag = db.from_sequence(data, npartitions=50)
print("Number of partitions:", bag.npartitions)
# Parallel transformations
result = (
bag.filter(lambda x: x["value"] > 500)
.groupby(lambda x: x["region"])
.map(lambda group: {"region": group[0], "total": sum(item["value"] for item in group[1])})
.compute()
)
print(result)
2. From Generator to Bag (Memory Efficient)
def data_generator():
for i in range(500_000):
yield {"id": i, "amount": i * 2.5, "category": "A" if i % 5 == 0 else "B"}
# Convert generator directly to Dask Bag
bag = db.from_sequence(data_generator(), npartitions=100)
# Perform parallel aggregation
summary = (
bag.filter(lambda x: x["amount"] > 1000)
.groupby(lambda x: x["category"])
.map(lambda g: {"category": g[0], "total": sum(x["amount"] for x in g[1])})
.compute()
)
print(summary)
3. Best Practices in 2026
- Use
db.from_sequence()for lists or generators - Choose appropriate
npartitionsbased on data size and available cores - Filter early using
.filter()to reduce data volume - Convert to Dask DataFrame once data has clear tabular structure
- Monitor the Dask Dashboard to see how partitions are being processed
Conclusion
Converting sequences and generators into Dask Bags is a powerful pattern for memory-efficient parallel processing. In 2026, this approach is commonly used for log processing, JSON data, and any irregular or streaming data sources that don’t fit neatly into DataFrames.
Next steps:
- Try converting one of your generator-based or list-based processing scripts to a Dask Bag