Using Python's glob module

Using Python's glob module is the standard, flexible way to discover and select multiple files by pattern — using wildcards (*, ?, [a-z], **) to match filenames or paths before passing them to data loading functions in Dask, pandas, Polars, or custom scripts. In 2026, glob remains essential for partitioned data workflows — loading monthly/yearly CSVs/JSONL/Parquet/HDF5 files, processing log rotations, aggregating sensor exports, or handling earthquake catalogs split by time/region — enabling declarative, scalable file discovery without manual lists or hardcoding paths.

Here’s a complete, practical guide to using Python’s glob module: basic patterns, recursive & advanced matching, integration with Dask/pandas/Polars, real-world patterns (earthquake data partitions, logs, multi-file catalogs), and modern best practices with performance, error handling, path manipulation, and pathlib alternatives.

Basic glob usage — find files matching simple patterns.


from glob import glob

# All CSV files in current directory
csv_files = glob('*.csv')
print(f"Found {len(csv_files)} CSV files:", csv_files)

# All JSONL files in a specific folder
jsonl_files = glob('data/*.jsonl')
print(jsonl_files)

# Files with numeric suffix (e.g., file1.txt, file2.txt)
numeric_files = glob('file[0-9].txt')
print(numeric_files)

Recursive & advanced globbing — use ** for subdirectories (requires recursive=True).


# Recursive: all CSV files in any subfolder
all_csv = glob('project/**/*.csv', recursive=True)
print(f"Found {len(all_csv)} CSVs recursively")

# Exclude patterns (combine with list comprehension)
all_logs = [f for f in glob('logs/**/*.log', recursive=True) if 'error' not in f]

# Character classes & negation
data_2024 = glob('data/202[4-5]*.parquet')  # 2024 & 2025 only
no_temp = glob('temp/[!0-9]*.txt')  # files not starting with digit

Integration with Dask, pandas, Polars — glob + load in one line.


import dask.dataframe as dd
import pandas as pd
import polars as pl

# Dask: lazy multi-file read
ddf = dd.read_csv(glob('earthquakes/*.csv'))  # auto-concatenates
print(ddf.head())

# Pandas: eager load & concat
df_pd = pd.concat(pd.read_csv(f) for f in glob('data/*.csv'))

# Polars: fast lazy scan
pl_lazy = pl.scan_csv(glob('logs/*.csv.gz'))  # auto-decompress
print(pl_lazy.fetch(5))  # preview

Real-world pattern: loading partitioned earthquake data — glob monthly CSVs or JSONL files.


# Glob all monthly earthquake CSVs (2024–2025)
monthly_csvs = glob('usgs/earthquakes_202[4-5]*.csv')

# Dask: lazy load & process
ddf = dd.read_csv(monthly_csvs, blocksize='128MB', assume_missing=True)

# Filter strong events & compute stats
strong = ddf[ddf['mag'] >= 7.0]
mean_by_year = strong.assign(year=strong['time'].dt.year).groupby('year')['mag'].mean().compute()
print(mean_by_year)

# Alternative: recursive glob for nested JSONL
all_jsonl = glob('catalogs/**/*.jsonl', recursive=True)
bag = db.read_text(all_jsonl).map(json.loads).filter(lambda e: e.get('mag', 0) >= 6.0)
print(f"Strong events across all files: {bag.count().compute()}")

Best practices for glob in Python & data workflows. Prefer glob.glob(pattern, recursive=True) — for deep traversal. Modern tip: use Polars pl.scan_csv('data/**/*.csv') — fast lazy multi-file scanning; Dask for distributed scale. Use **/*.ext — recursive, but test to avoid huge matches. Use [0-9] ranges — for date/number patterns (e.g., 202[4-5]*.csv). Avoid over-globbing — specific patterns prevent loading unwanted files. Use pathlib.Path.glob()/rglob() — modern object-oriented alternative. Use include_path_column=True in Dask — track source file. Use blocksize — '64MB'–'256MB' balances parallelism. Validate matches — print(glob('pattern')) before loading. Use glob.glob('**/*.gz', recursive=True) — with compression auto-detected. Use db.read_text('logs/*.log') — for line-based text. Use db.from_filenames() — for file-level processing. Profile with timeit — glob vs manual list. Use pathlib.Path.rglob() — recursive Path glob. Use fnmatch — lower-level pattern matching if needed.

Python’s glob module matches file patterns efficiently — use *, ?, [a-z], ** for directories, load directly in Dask/Polars/pandas with glob strings. In 2026, use recursive globs, specific patterns, blocksize control, persist intermediates, and monitor dashboard. Master globbing, and you’ll load partitioned data scalably and declaratively for any big data pipeline.

Next time you need multiple files — glob them right. It’s Python’s cleanest way to say: “Load all these matching files — in parallel, without listing every one.”

Generating content...