Extracting Dask array from HDF5 for Analyzing Earthquake Data

Extracting Dask array from HDF5 for Analyzing Earthquake Data is a powerful, scalable approach for handling large seismic catalogs stored in HDF5 — the format excels at compression, chunking, partial I/O, and metadata, making it ideal for time-indexed, multi-variable earthquake data (time, latitude, longitude, depth, magnitude). By extracting datasets as Dask arrays via da.from_array(dset, chunks=...), you gain lazy, parallel, out-of-core processing: compute statistics, filter events, or visualize maps without loading the full file into memory. In 2026, this workflow is standard for geophysics and disaster analysis — combining h5py for file access, Dask for parallelism, xarray for labeled lazy arrays, and geospatial tools (cartopy, geopandas) for mapping — enabling analysis of global catalogs spanning decades or real-time feeds.

Here’s a complete, practical guide to extracting Dask arrays from HDF5 earthquake data: storing CSV to HDF5, lazy extraction with chunk control, computing statistics, filtering/visualizing, and modern best practices with xarray integration, rechunking, distributed execution, compression, and performance optimization.

Storing USGS earthquake CSV to HDF5 — efficient, compressed, metadata-rich format.


import pandas as pd
import h5py

# Load USGS data (example: recent M?5 events)
df = pd.read_csv("https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv&minmagnitude=5&starttime=2024-01-01")

with h5py.File('earthquakes.h5', 'w') as f:
    # Store core columns as datasets
    f.create_dataset('time', data=pd.to_datetime(df['time']).values.astype('datetime64[ns]'),
                     compression='gzip', chunks=True)
    f.create_dataset('latitude', data=df['latitude'].to_numpy(),
                     compression='gzip', chunks=True)
    f.create_dataset('longitude', data=df['longitude'].to_numpy(),
                     compression='gzip', chunks=True)
    f.create_dataset('depth', data=df['depth'].to_numpy(),
                     compression='gzip', chunks=True)
    f.create_dataset('magnitude', data=df['mag'].to_numpy(),
                     compression='gzip', chunks=True)
    
    # Metadata as attributes
    f.attrs['source'] = 'USGS FDSN Event Service'
    f.attrs['query_date'] = str(pd.Timestamp.now())
    f.attrs['min_magnitude'] = 5.0
    f.attrs['description'] = 'Global earthquakes M?5, 2024–present'
    
    # Optional: store place names as fixed-length strings
    f.create_dataset('place', data=df['place'].astype('S').to_numpy(),
                     compression='gzip', chunks=True)

Extracting Dask arrays from HDF5 — lazy, chunked access for parallel computation.


import dask.array as da

with h5py.File('earthquakes.h5', 'r') as f:
    # Extract each dataset as Dask array (use native chunks or 'auto')
    lat = da.from_array(f['latitude'], chunks='auto')
    lon = da.from_array(f['longitude'], chunks='auto')
    depth = da.from_array(f['depth'], chunks='auto')
    mag = da.from_array(f['magnitude'], chunks='auto')
    
    print(mag)  # dask.array
    
    # Lazy global stats
    mean_mag = mag.mean().compute()
    max_depth = depth.max().compute()
    print(f"Mean magnitude: {mean_mag:.2f}, Max depth: {max_depth} km")

Real-world pattern: filtering, aggregating, and visualizing earthquake data lazily.


# Filter strong events (M ? 7) and compute stats
strong_mag = mag[mag >= 7]
strong_count = strong_mag.size.compute()
print(f"Earthquakes M?7: {strong_count}")

# Spatial subset: Pacific region
pacific_mask = (lat >= -60) & (lat <= 60) & (lon >= -180) & (lon <= -120)
pacific_mag = mag[pacific_mask]
print(f"Mean magnitude in Pacific: {pacific_mag.mean().compute():.2f}")

# Visualization: scatter plot of strong events
import matplotlib.pyplot as plt
strong_lat = lat[mag >= 7].compute()
strong_lon = lon[mag >= 7].compute()
strong_mag_val = mag[mag >= 7].compute()

plt.figure(figsize=(12, 6))
plt.scatter(strong_lon, strong_lat, c=strong_mag_val, s=strong_mag_val**2,
            cmap='Reds', alpha=0.7, edgecolor='k')
plt.colorbar(label='Magnitude')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Earthquakes M?7 (2024–present)')
plt.show()

Best practices for HDF5 + Dask earthquake analysis. Use native HDF5 chunks — chunks=dset.chunks — optimal I/O. Modern tip: prefer xarray — xr.open_dataset('earthquakes.h5', chunks={...}) — labeled, lazy, auto-handles time/lat/lon. Rechunk for analysis — mag.rechunk({'time': -1}) before time reductions. Visualize graph — mean().visualize() to debug chunk alignment. Persist intermediates — mag.persist() for repeated queries. Use distributed scheduler — Client() for clusters. Add type hints — def analyze_mag(mag: da.Array[np.float64, (None,)]) -> float. Monitor dashboard — task times/memory per chunk. Use compression — 'gzip' or 'lzf' when writing. Use partial I/O — dset[0:10000] for memory efficiency. Use da.map_blocks — custom chunk functions (e.g., per-chunk magnitude histograms). Test small slices — mag[:1000].compute(). Use cartopy/geopandas — accurate spatial plots. Use f.attrs/dset.attrs — store units, source, query info.

Extracting Dask arrays from HDF5 earthquake files enables lazy, parallel analysis — use da.from_array(dset, chunks=...), compute stats/filters lazily, visualize with matplotlib/cartopy. In 2026, prefer xarray for labeled access, rechunk for reductions, persist intermediates, use compression, and monitor dashboard. Master HDF5 + Dask, and you’ll analyze massive seismic datasets efficiently, scalably, and with full metadata.

Next time you have large earthquake data in HDF5 — extract it as Dask arrays. It’s Python’s cleanest way to say: “Make this seismic archive computable — in parallel, without loading it all.”

Generating content...