Building delayed pipelines

Building delayed pipelines is one of the most flexible and powerful patterns in Dask — it lets you compose complex, multi-step workflows as a directed acyclic graph (DAG) of lazy computations using the @delayed decorator or dask.delayed wrapper. Each step is deferred until you call dask.compute(), allowing Dask to automatically parallelize, optimize, and schedule the entire pipeline across multiple cores, processes, or distributed clusters. In 2026, delayed pipelines are foundational for custom ETL, feature engineering, simulation ensembles, machine learning preprocessing, and scientific analysis — especially when high-level collections (DataFrame/Array) are too rigid or you need to integrate arbitrary Python functions, conditionals, loops, or external libraries.

Here’s a complete, practical guide to building delayed pipelines in Dask: @delayed basics, chaining functions, handling loops/conditionals, aggregating results, real-world patterns (multi-file processing, earthquake data pipeline), and modern best practices with type hints, visualization, distributed execution, error handling, and xarray/Polars equivalents.

Basic delayed pipeline — define steps with @delayed, chain them lazily, compute at the end.


import dask
from dask import delayed
import pandas as pd

@delayed
def read_file(file_path: str) -> pd.DataFrame:
    """Load CSV into pandas DataFrame."""
    return pd.read_csv(file_path)

@delayed
def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """Basic cleaning: drop NaNs, filter magnitude."""
    return df.dropna(subset=['mag']).query('mag >= 5.0')

@delayed
def compute_stats(df: pd.DataFrame) -> dict:
    """Extract key statistics."""
    return {
        'count': len(df),
        'mean_mag': df['mag'].mean(),
        'max_depth': df['depth'].max(),
        'countries': df['place'].value_counts().head(5).to_dict()
    }

# Build pipeline for multiple files
files = ['quakes_2024_01.csv', 'quakes_2024_02.csv', 'quakes_2024_03.csv']

# Lazy pipeline: read ? clean ? stats
results = [compute_stats(clean_data(read_file(f))) for f in files]

# Compute all in parallel
final_results = dask.compute(*results)

# Aggregate across files
total_count = sum(r['count'] for r in final_results)
global_mean_mag = sum(r['mean_mag'] * r['count'] for r in final_results) / total_count
print(f"Total events: {total_count}, Global mean magnitude: {global_mean_mag:.2f}")

Handling loops & conditionals — delayed pipelines support dynamic logic.


@delayed
def process_file(file: str) -> dict:
    df = pd.read_csv(file)
    if len(df) < 100:
        return {'status': 'skipped', 'file': file}
    cleaned = df.dropna(subset=['mag'])
    return {
        'file': file,
        'count': len(cleaned),
        'mean_mag': cleaned['mag'].mean()
    }

# Loop over files, build delayed tasks
files = ['data/file_{i}.csv' for i in range(10)]
delayed_tasks = [process_file(f) for f in files]

# Compute all, filter valid results
results = dask.compute(*delayed_tasks)
valid_results = [r for r in results if r['status'] != 'skipped']
print(f"Processed {len(valid_results)} files successfully")

Real-world pattern: delayed pipeline for multi-file earthquake catalog processing — read, clean, extract stats, aggregate.


@delayed
def load_and_filter(file: str) -> pd.DataFrame:
    df = pd.read_csv(file, parse_dates=['time'])
    return df.query('mag >= 6.0 and depth <= 70.0')  # shallow strong events

@delayed
def extract_features(df: pd.DataFrame) -> pd.DataFrame:
    df['year'] = df['time'].dt.year
    df['country'] = df['place'].str.split(',').str[-1].str.strip()
    return df[['time', 'year', 'latitude', 'longitude', 'depth', 'mag', 'country']]

@delayed
def summarize(df: pd.DataFrame) -> dict:
    return {
        'event_count': len(df),
        'mean_mag': df['mag'].mean(),
        'max_depth': df['depth'].max(),
        'top_countries': df['country'].value_counts().head(5).to_dict()
    }

# Build pipeline over partitioned files
monthly_files = ['quakes/month_{:02d}.csv'.format(i) for i in range(1, 13)]

pipelines = [summarize(extract_features(load_and_filter(f))) for f in monthly_files]

# Compute all in parallel
monthly_summaries = dask.compute(*pipelines)

# Aggregate across months
total_events = sum(s['event_count'] for s in monthly_summaries)
overall_mean_mag = sum(s['mean_mag'] * s['event_count'] for s in monthly_summaries) / total_events
print(f"Total strong shallow events: {total_events}")
print(f"Overall mean magnitude: {overall_mean_mag:.2f}")

Best practices for building delayed pipelines in Dask. Use @delayed on pure functions — no side effects, deterministic. Modern tip: prefer xarray + Dask for gridded data — xr.open_mfdataset(..., chunks={...}) — labeled lazy pipelines. Visualize graph — final_result.visualize() to debug dependencies. Persist intermediates — cleaned.persist() for reuse. Use distributed scheduler — Client() for clusters. Add type hints — @delayed def load(file: str) -> pd.DataFrame. Handle errors — wrap delayed functions in try/except. Use dask.compute(*tasks) — parallel execution of independent branches. Monitor dashboard — http://127.0.0.1:8787 — task times/memory. Avoid delayed in tight loops — batch operations. Use da.map_blocks — for array-based pipelines. Test small inputs — dask.compute(*small_tasks). Use dask.config.set(scheduler='distributed') — or env vars. Use client.restart() — clear state on errors. Close client — client.close() when done.

Building delayed pipelines with @delayed composes complex workflows lazily — read/clean/analyze multi-file data, chain steps, compute in parallel with dask.compute(). In 2026, visualize graphs, persist intermediates, use distributed client, prefer xarray for labeled data, and monitor dashboard. Master delayed pipelines, and you’ll create scalable, modular, parallel workflows for earthquake or any large-scale data processing.

Next time you need a multi-step data pipeline — build it delayed with Dask. It’s Python’s cleanest way to say: “Define the workflow — compute it fast and in parallel when ready.”

Generating content...