Building Dask Bags & Globbing

Building Dask Bags & Globbing is a flexible, powerful way to handle large-scale non-tabular or semi-structured data (text files, JSON, log lines, binary blobs, custom records) in parallel — especially when data doesn’t fit neatly into DataFrames or Arrays. Dask Bags treat each file or line as an independent element, allowing map/filter/reduce operations across millions of items without loading everything into memory. Globbing with from_filenames() or read_text() makes it easy to process partitioned directories or cloud storage. In 2026, Dask Bags remain essential for ETL on raw logs, text corpora, JSONL datasets, sensor streams, or pre-processed earthquake catalogs — complementing Dask DataFrames (tabular) and Arrays (n-D numerical) while scaling to clusters with minimal code.

Here’s a complete, practical guide to building Dask Bags with globbing: creation from files/text, map/filter/reduce operations, real-world patterns (log analysis, JSONL processing, text corpus, earthquake metadata), and modern best practices with chunk control, parallelism, visualization, distributed execution, and Polars/xarray equivalents.

Creating Dask Bags with globbing — load files or lines lazily from directories/cloud.


import dask.bag as db

# Glob all text/JSON files in a directory
bag_files = db.read_text('logs/*.log')  # each line is an element
bag_jsonl = db.read_text('data/*.jsonl')  # each line is a JSON string

# From explicit file list or glob
files = ['file1.txt', 'file2.txt', 'file3.txt']
bag_from_list = db.from_sequence(files).map(open).map(str.read).map(str.splitlines).flatten()

# From filenames (more control over file-level processing)
bag_filenames = db.from_filenames('earthquakes/*.json', include_path=True)
print(bag_filenames)  # dask.bag, each item is (path, file content)

Common bag operations — map/filter/reduce, similar to Python iterables but parallelized.


# Map: parse JSON lines
parsed = bag_jsonl.map(json.loads)

# Filter: keep only high-magnitude events
strong = parsed.filter(lambda event: event.get('mag', 0) >= 7.0)

# Map: extract features
features = strong.map(lambda e: {
    'time': e['time'],
    'mag': e['mag'],
    'lat': e['latitude'],
    'lon': e['longitude'],
    'depth': e['depth']
})

# Reduce: count total strong events
count = strong.count().compute()
print(f"Strong earthquakes: {count}")

# Aggregate: collect all magnitudes
mags = strong.map(lambda e: e['mag']).compute()
print(f"Mean magnitude: {sum(mags)/len(mags):.2f}")

Real-world pattern: processing partitioned earthquake metadata or logs from multiple files.


# Glob daily JSONL files from USGS-style exports
bag = db.read_text('quakes/*.jsonl')

# Parse & filter strong, shallow events
strong_shallow = (
    bag
    .map(json.loads)
    .filter(lambda e: e.get('mag', 0) >= 6.0 and e.get('depth', 1000) <= 70)
    .map(lambda e: {
        'id': e['id'],
        'time': e['time'],
        'mag': e['mag'],
        'depth': e['depth'],
        'place': e['place']
    })
)

# Compute count and sample
print(f"Strong shallow events: {strong_shallow.count().compute()}")
sample = strong_shallow.take(5)  # first 5 items
print(sample)

# Convert to DataFrame for further analysis
df_strong = strong_shallow.to_dataframe().compute()
print(df_strong.head())

Best practices for building Dask Bags with globbing. Use read_text() for line-based data — simplest for logs/JSONL. Modern tip: use Polars for columnar data — pl.scan_ndjson('data/*.jsonl') — faster for structured data; use Bags for truly unstructured/text-heavy. Use from_filenames() — when processing whole files (e.g., parse per file). Set include_path=True — track source file if needed. Use map_partitions — for custom file-level logic. Visualize task graph — strong_shallow.visualize() to debug. Persist hot bags — strong_shallow.persist() for repeated ops. Use distributed client — Client() for clusters. Add type hints — @delayed def parse_line(line: str) -> dict. Monitor dashboard — memory/tasks/progress. Avoid large reductions early — filter/map first. Use flatten() — after reading multi-line files. Use pluck() — extract fields efficiently. Use to_dataframe() — convert to Dask DataFrame when tabular structure emerges. Use db.from_sequence() — for in-memory lists of tasks. Profile with timeit — compare bag vs pandas loops.

Building Dask Bags with globbing handles large non-tabular data — read_text/from_filenames for files/lines, map/filter/reduce for processing, to_dataframe for tabular transition. In 2026, use chunks='auto', persist intermediates, visualize graphs, prefer Polars for structured data, and monitor dashboard. Master Dask Bags, and you’ll process massive text/JSON/log datasets efficiently, scalably, and with full parallel power.

Next time you have partitioned text or JSON files — build a Dask Bag. It’s Python’s cleanest way to say: “Process these files in parallel — line by line or file by file.”

Generating content...