HDF5 format (Hierarchical Data Format version 5)

HDF5 format (Hierarchical Data Format version 5) is a versatile, high-performance file format and library for storing and managing large, complex scientific datasets — widely used in climate modeling, satellite imagery, genomics, simulations, particle physics, and astronomy. It supports hierarchical organization (groups like folders), multidimensional datasets with chunking/compression, rich metadata (attributes), parallel I/O, and partial access — making it ideal for out-of-core and distributed workflows. In 2026, HDF5 remains the de facto standard for big scientific data — powering netCDF4, NASA/ESA missions, supercomputing, and integration with NumPy, h5py, Dask, xarray, pandas, Polars, and PyTorch/TensorFlow for seamless loading and computation.

Here’s a complete, practical guide to working with HDF5 in Python using h5py: file creation/reading, groups/datasets/attributes, chunking/compression, partial I/O, real-world patterns (time series, image stacks, large simulations), and modern best practices with type hints, Dask/xarray integration, memory efficiency, and performance optimization.

Basic HDF5 file operations — create, write, read datasets and attributes.


import h5py
import numpy as np

# Create new HDF5 file (write mode)
with h5py.File('example.h5', 'w') as f:
    # Create group (like folder)
    grp = f.create_group('measurements')
    
    # Create dataset with data
    data = np.random.rand(1000, 500)  # 1000 time steps × 500 sensors
    dset = grp.create_dataset('sensor_data', data=data, compression='gzip')
    
    # Add attributes (metadata)
    dset.attrs['units'] = 'volts'
    dset.attrs['sampling_rate'] = 100.0  # Hz
    dset.attrs['description'] = 'Multi-sensor time series'

    # Scalar dataset for global metadata
    f.create_dataset('version', data=np.string_('1.0'))

# Read existing HDF5 file
with h5py.File('example.h5', 'r') as f:
    print(list(f.keys()))                 # ['measurements', 'version']
    print(list(f['measurements'].keys())) # ['sensor_data']
    
    # Read dataset (full or slice)
    sensor_data = f['measurements/sensor_data'][:]  # full read
    partial = f['measurements/sensor_data'][500:600, :]  # partial I/O
    
    # Read attributes
    print(dset.attrs['units'])            # 'volts'
    print(dset.attrs['sampling_rate'])    # 100.0

Chunking & compression — critical for large datasets and partial access.


with h5py.File('chunked.h5', 'w') as f:
    # Large 3D array: time × lat × lon
    shape = (10000, 720, 1440)  # 10k time steps, 0.25° global grid
    chunks = (100, 720, 1440)   # chunk along time for time-series access
    
    dset = f.create_dataset(
        'temperature',
        shape=shape,
        dtype='float32',
        chunks=chunks,
        compression='gzip',
        compression_opts=4  # 1-9, higher = better compression but slower
    )
    
    # Write in chunks (partial write)
    for t in range(0, 10000, 100):
        chunk_data = np.random.rand(100, 720, 1440).astype('float32')
        dset[t:t+100, :, :] = chunk_data

Real-world pattern: time series + multidimensional storage with metadata.


with h5py.File('climate.h5', 'w') as f:
    # Time index
    times = np.arange('2020-01-01', '2025-01-01', dtype='datetime64[D]')
    f.create_dataset('time', data=times)
    
    # 3D data: time × lat × lon
    temp_data = np.random.rand(len(times), 180, 360) * 30 + 273  # Kelvin
    temp = f.create_dataset('temperature', data=temp_data, chunks=(100, 180, 360), compression='gzip')
    temp.attrs['units'] = 'Kelvin'
    temp.attrs['long_name'] = 'Surface air temperature'
    
    # Attributes on file/group
    f.attrs['source'] = 'Climate model run v1.0'
    f.attrs['creation_date'] = str(np.datetime64('now'))

Best practices for HDF5 with h5py. Use chunking — align with access patterns (e.g., chunk along time for time-series reads). Modern tip: prefer xarray + h5py/Dask — xr.open_dataset('file.h5') — labeled, lazy, netCDF-compatible. Use compression — 'gzip' or 'lzf' for size reduction. Use chunks wisely — too small = overhead, too large = memory issues. Add attributes liberally — store units, scale_factor, description. Use partial I/O — slice datasets for reading/writing. Use f.create_dataset(..., fillvalue=np.nan) — for missing data. Monitor memory — dset[:].nbytes for full read size. Use Dask integration — da.from_array(h5py.File('file.h5')['dataset'], chunks=...). Use with h5py.File(...) as f — safe file handling. Test partial reads — dset[0:1000, :, :]. Use f.visit() — explore structure. Use attrs for metadata — not in dataset names. Use np.string_ — for text attributes in HDF5.

HDF5 with h5py enables hierarchical, chunked, compressed, metadata-rich storage for large multidimensional data — create groups/datasets/attributes, use chunking/compression, read/write partially. In 2026, use xarray for labeled access, Dask for parallel scale, compression for size, and monitor with nbytes. Master HDF5, and you’ll store and retrieve massive scientific datasets efficiently and reliably.

Next time you need to save or load large structured data — use HDF5. It’s Python’s cleanest way to say: “Store my complex arrays — hierarchically, efficiently, with metadata.”

Generating content...