Using HDF5 files for analyzing earthquake data

Using HDF5 files for analyzing earthquake data is a smart, scalable approach when dealing with large seismic catalogs — HDF5 offers hierarchical organization, compression, chunking, partial I/O, and rich metadata, making it ideal for storing time-indexed, multi-variable earthquake data (time, lat, lon, depth, magnitude, etc.) efficiently. In 2026, HDF5 remains the standard for big scientific datasets (USGS, IRIS, global catalogs), enabling fast queries, parallel access via h5py/Dask, labeled analysis with xarray, and integration with pandas/Polars for exploration or ML. This workflow avoids CSV bloat, supports compression (reduce file size 50–80%), and allows lazy loading of only needed slices — perfect for terabyte-scale historical or real-time seismic archives.

Here’s a complete, practical guide to using HDF5 for earthquake data analysis in Python: downloading USGS data, storing to HDF5 with h5py, extracting with lazy Dask/xarray, aggregating/visualizing, and modern best practices with chunking/compression, metadata, performance, and geospatial integration.

Downloading & preparing USGS earthquake data — fetch recent/large catalogs as CSV.


import requests
import pandas as pd
from datetime import datetime

# USGS FDSN query: magnitude ? 5, last 5 years, ordered by time
base_url = "https://earthquake.usgs.gov/fdsnws/event/1/query"
params = {
    "format": "csv",
    "starttime": "2021-01-01",
    "endtime": datetime.now().strftime("%Y-%m-%d"),
    "minmagnitude": "5",
    "orderby": "time-asc",
    "limit": "20000"  # adjust for larger sets
}

response = requests.get(base_url, params=params)
response.raise_for_status()

df = pd.read_csv(response.url)  # direct to DataFrame
print(df.head())
print(f"Loaded {len(df)} earthquakes")

Storing earthquake data to HDF5 — efficient, compressed, with metadata.


import h5py

with h5py.File('earthquakes.h5', 'w') as f:
    # Store core columns as datasets
    f.create_dataset('time', data=pd.to_datetime(df['time']).values.astype('datetime64[ns]'),
                     compression='gzip', chunks=True)
    f.create_dataset('latitude', data=df['latitude'].to_numpy(),
                     compression='gzip', chunks=True)
    f.create_dataset('longitude', data=df['longitude'].to_numpy(),
                     compression='gzip', chunks=True)
    f.create_dataset('depth', data=df['depth'].to_numpy(),
                     compression='gzip', chunks=True)
    f.create_dataset('magnitude', data=df['mag'].to_numpy(),
                     compression='gzip', chunks=True)
    
    # Add attributes (metadata)
    f.attrs['source'] = 'USGS FDSN Event Web Service'
    f.attrs['query_date'] = str(datetime.now())
    f.attrs['min_magnitude'] = 5.0
    f.attrs['description'] = 'Global earthquakes M?5, 2021–present'
    
    # Group for extra columns if needed
    extra_grp = f.create_group('extra')
    extra_grp.create_dataset('place', data=df['place'].astype('S').to_numpy(),
                             compression='gzip', chunks=True)

Extracting & analyzing from HDF5 — lazy with Dask or xarray for large files.


# Option 1: Dask array extraction (for numerical columns)
with h5py.File('earthquakes.h5', 'r') as f:
    mag_dask = da.from_array(f['magnitude'], chunks='auto')
    lat_dask = da.from_array(f['latitude'], chunks='auto')
    lon_dask = da.from_array(f['longitude'], chunks='auto')

# Compute stats lazily
mean_mag = mag_dask.mean().compute()
max_depth = f['depth'][:].max()  # or use Dask
print(f"Mean magnitude: {mean_mag:.2f}, Max depth: {max_depth} km")

# Option 2: xarray for labeled, lazy access (recommended)
import xarray as xr
ds = xr.open_dataset('earthquakes.h5', chunks={'time': 1000})
print(ds)  # shows dimensions, coords, variables

# Aggregations with xarray
mean_by_year = ds['magnitude'].groupby('time.year').mean()
print(mean_by_year.compute())

# Spatial subset example
pacific = ds.sel(latitude=slice(-60, 60), longitude=slice(-180, -120))
print(pacific['magnitude'].mean().compute())

Visualization patterns — magnitude histogram, spatial map, time series trends.


import matplotlib.pyplot as plt
import cartopy.crs as ccrs

# Histogram of magnitudes
plt.hist(ds['magnitude'].compute(), bins=50, edgecolor='black')
plt.title('Earthquake Magnitude Distribution (M?5)')
plt.xlabel('Magnitude')
plt.ylabel('Count')
plt.show()

# Spatial scatter map
fig = plt.figure(figsize=(12, 6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
scatter = ax.scatter(ds['longitude'], ds['latitude'], c=ds['magnitude'],
                     s=ds['magnitude']**1.5, cmap='Reds', alpha=0.6,
                     transform=ccrs.PlateCarree())
plt.colorbar(scatter, label='Magnitude')
ax.set_title('Global Earthquakes M?5')
plt.show()

Best practices for HDF5 earthquake analysis. Use compression ('gzip') — reduces file size 50–80%. Modern tip: prefer xarray + h5py/Dask — xr.open_dataset() gives labeled, lazy access with netCDF semantics. Chunk along time — chunks={'time': 1000} for time-series queries. Add rich attributes — units, source, query params. Use partial I/O — slice datasets dset[0:10000] for memory efficiency. Use Dask for parallel stats — da.from_array(dset, chunks=...). Use geospatial tools — cartopy/geopandas for mapping. Monitor memory — dset[:].nbytes for full read size. Use with h5py.File(...) — safe handling. Use f.visititems() — explore structure. Test on small subsets — dset[:1000].compute(). Use Polars — for fast columnar queries after loading to DataFrame. Use xr.apply_ufunc — NumPy functions on labeled data. Use dask.distributed — scale to clusters for very large catalogs.

Using HDF5 files for earthquake data analysis enables efficient storage, compression, chunked access, and metadata-rich processing — download USGS CSV, store with h5py, analyze lazily with Dask/xarray, and visualize with matplotlib/cartopy. In 2026, chunk along time, use xarray for labels, Dask for scale, compression for size, and monitor dashboard/memory. Master HDF5 workflows, and you’ll handle massive seismic catalogs reliably and scalably.

Next time you work with large earthquake or scientific data — store and analyze it in HDF5. It’s Python’s cleanest way to say: “Organize my big seismic dataset — hierarchically, efficiently, and ready for parallel compute.”

Generating content...