Glob expressions

Glob expressions are the standard way to specify file patterns for matching multiple files or paths in Python — using wildcards like *, ?, [abc], [a-z], and more to select sets of files efficiently. In 2026, glob patterns remain essential across data workflows — reading partitioned CSV/JSONL/Parquet/HDF5 files with Dask (read_csv('data/*.csv')), Polars (scan_csv('logs/*.log')), pandas (read_csv('files/*.csv')), or custom scripts. They enable scalable, declarative data loading without listing every file, perfect for time-series partitions, daily logs, monthly catalogs, or distributed storage (S3, GCS, Azure).

Here’s a complete, practical guide to glob expressions in Python: basic patterns & wildcards, advanced usage (recursive, negation, character classes), real-world patterns (earthquake data partitions, logs, multi-file catalogs), and modern best practices with performance, error handling, Dask/Polars/pandas integration, and path manipulation tips.

Basic glob patterns — common wildcards and their meanings.

* — matches any sequence of characters (including none): *.csv ? all CSV files
? — matches any single character: file?.txt ? file1.txt, fileA.txt, etc.
[abc] — matches any one character from the set: log[123].txt ? log1.txt, log2.txt, log3.txt
[a-z] — matches any character in range: data[0-9].csv ? data0.csv to data9.csv
[!abc] or [^abc] — matches any character NOT in set: log[!0].txt ? excludes log0.txt
** — recursive glob (any subdirectories): **/*.jsonl ? all JSONL files in any subfolder

Using glob in Python — glob module for listing, Dask/Polars/pandas for direct loading.


from glob import glob

# List matching files
csv_files = glob('data/*.csv')              # non-recursive
all_jsonl = glob('logs/**/*.jsonl', recursive=True)  # recursive

print(f"Found {len(csv_files)} CSV files")

# Dask: read multiple files directly
import dask.dataframe as dd
ddf = dd.read_csv('earthquakes/*.csv', blocksize='64MB')
print(ddf.head())

# Polars: scan multiple files (lazy)
import polars as pl
pl_df = pl.scan_csv('logs/*.log.gz')  # auto-decompresses
print(pl_df.fetch(5))  # preview

# Pandas: concatenate on read
import pandas as pd
df_pd = pd.concat(pd.read_csv(f) for f in glob('files/*.csv'))

Real-world pattern: reading partitioned earthquake data — glob monthly/yearly CSVs or JSONL files.


# Glob all monthly earthquake CSVs (2024–2025)
monthly_files = glob('usgs/earthquakes_202[4-5]*.csv')  # 2024 & 2025 only

# Dask: lazy load & concatenate
ddf = dd.read_csv(monthly_files, assume_missing=True, blocksize='128MB')

# Filter strong events & compute stats
strong = ddf[ddf['mag'] >= 7.0]
mean_by_year = strong.assign(year=strong['time'].dt.year).groupby('year')['mag'].mean().compute()
print(mean_by_year)

# Alternative: recursive glob for nested directories
all_jsonl = glob('catalogs/**/*.jsonl', recursive=True)
bag = db.read_text(all_jsonl).map(json.loads).filter(lambda e: e.get('mag', 0) >= 6.0)
print(f"Strong events across all files: {bag.count().compute()}")

Best practices for glob expressions in Python & data workflows. Use glob.glob(pattern, recursive=True) — for deep directory traversal. Modern tip: prefer Polars pl.scan_csv('data/**/*.csv') — fast lazy multi-file scanning; Dask for distributed scale. Use **/*.ext — recursive search, but be cautious with large dirs. Use [0-9] ranges — for date/number patterns (e.g., 202[4-5]*.csv). Avoid over-globbing — use specific patterns to prevent loading unwanted files. Use pathlib.Path.glob() — modern alternative with Path objects. Use include_path_column=True in Dask — track source file. Use blocksize — '64MB'–'256MB' balances parallelism. Validate matches — print(glob('pattern')) before loading. Use glob.glob('**/*.gz', recursive=True) — with compression auto-detected. Use dd.read_text('logs/*.log') — for line-based text. Use db.from_filenames() — for file-level processing. Profile with timeit — glob vs manual list. Use pathlib.rglob() — recursive Path glob. Use fnmatch — lower-level pattern matching if needed.

Glob expressions match file patterns efficiently — use *, ?, [a-z], ** for directories, load directly in Dask/Polars/pandas with glob strings. In 2026, use recursive globs, specific patterns, blocksize control, persist intermediates, and monitor dashboard. Master globbing, and you’ll load partitioned data scalably and declaratively for any big data pipeline.

Next time you need multiple files — glob them right. It’s Python’s cleanest way to say: “Load all these matching files — in parallel, without listing every one.”

Generating content...