Reading text files

Reading text files into Dask is the gateway to scalable, parallel processing of large-scale unstructured or semi-structured text data — logs, documents, JSONL, CSV-like files, sensor streams, or any line-based format too big for memory. Dask’s dask.bag.read_text() reads files lazily via glob patterns or lists, creating a Bag where each line (or file) is an element that can be mapped, filtered, reduced, or converted to DataFrames in parallel across cores or clusters. In 2026, this remains essential for log analysis, text mining, NLP preprocessing, multi-file JSONL catalogs (e.g., earthquake metadata), and ETL pipelines — complementing Dask DataFrames (tabular) and Arrays (numerical) while scaling effortlessly with minimal code.

Here’s a complete, practical guide to reading text files with Dask: basic read_text(), globbing vs explicit lists, line vs whole-file reading, real-world patterns (earthquake JSONL, log parsing, multi-file text), and modern best practices with chunk control, parallelism, error handling, visualization, distributed execution, and Polars equivalents.

Basic text file reading — read_text() loads lines lazily from files matching a glob or list.


import dask.bag as db

# Single file: each line is an element
bag_single = db.read_text('large_log.txt')
print(bag_single.take(5))  # first 5 lines

# Multiple files via glob (most common)
bag_glob = db.read_text('logs/*.log')  # all .log files in directory
print(bag_glob.count().compute())  # total lines across files

# Explicit file list
files = ['file1.txt', 'file2.txt', 'file3.txt']
bag_list = db.read_text(files)

Advanced options — control blocksize (partition size), encoding, compression, whole-file vs lines.


# Custom partition size (bytes per chunk)
bag_chunked = db.read_text('big/*.txt', blocksize='64MB')

# Handle encoding & compression
bag_encoded = db.read_text('data/*.txt.gz', encoding='utf-8', compression='gzip')

# Whole-file reading (one element per file)
bag_whole = db.read_text('docs/*.md', blocksize=None)  # None = whole file
bag_whole = bag_whole.map(str.splitlines).flatten()  # optional: split to lines

Real-world pattern: processing multi-file earthquake metadata JSONL — parse, filter strong events, extract features.


# Glob daily JSONL exports (one JSON event per line)
bag = db.read_text('usgs/day_*.jsonl', blocksize='32MB')

# Pipeline: parse ? filter M?7 ? project fields ? count per country
pipeline = (
    bag
    .map(json.loads)                           # parse each line
    .filter(lambda e: e.get('mag', 0) >= 7.0)  # strong events
    .map(lambda e: {
        'time': e['time'],
        'mag': e['mag'],
        'lat': e['latitude'],
        'lon': e['longitude'],
        'depth': e['depth'],
        'place': e.get('place', 'Unknown'),
        'country': e['place'].split(',')[-1].strip() if ',' in e['place'] else 'Unknown'
    })
)

# Aggregate: count events per country
country_counts = pipeline.map(lambda e: (e['country'], 1)).groupby(lambda x: x[0]).map(lambda g: (g[0], sum(v for _, v in g[1])))

top_countries = sorted(country_counts.compute(), key=lambda x: x[1], reverse=True)[:10]
print("Top 10 countries by M?7 events:")
for country, count in top_countries:
    print(f"{country}: {count}")

# Sample 5 strong events
sample = pipeline.take(5)
print(sample)

Best practices for reading text files into Dask Bags. Use read_text() for line-based data — logs, JSONL, CSV-like; use blocksize=None for whole-file processing. Modern tip: use Polars for structured text — pl.scan_ndjson('files/*.jsonl') — faster columnar parsing; use Bags for truly unstructured or custom line formats. Set blocksize — '16MB'–'128MB' balances parallelism vs overhead. Use encoding='utf-8' — or 'latin1' for legacy files. Handle errors — .map(safe_parse, errors='ignore'). Visualize graph — pipeline.visualize() to debug. Persist hot bags — filtered.persist() for reuse. Use distributed client — Client() for clusters. Add type hints — def parse_line(line: str) -> dict | None. Monitor dashboard — memory/tasks/progress. Use .filter(None.__ne__) — after safe parsing. Use .pluck('key') — efficient field extraction. Use .to_dataframe() — transition to Dask DataFrame when tabular. Use .flatten() — after whole-file splitlines. Use orjson.loads — faster JSON parsing (pip install orjson). Profile with timeit — compare read_text vs pandas loops. Use db.from_sequence() — for in-memory lists needing parallel processing.

Reading text files into Dask Bags with read_text() enables parallel processing of large line-based datasets — globbing, blocksize control, map/filter/reduce lazily, and transition to DataFrames when needed. In 2026, use fast parsers like orjson, persist intermediates, visualize graphs, prefer Polars for structured text, and monitor dashboard. Master text reading in Dask, and you’ll handle massive logs, JSONL catalogs, or raw text efficiently and scalably.

Next time you have multiple text or JSONL files — read them with Dask. It’s Python’s cleanest way to say: “Process these files line by line — in parallel, at any scale.”

Generating content...