Skip to content

Add HDF5 metadata sidecar support (hdf5-meta extension)#157

Open
daltoncass wants to merge 2 commits into
sigmf:mainfrom
daltoncass:feature/hdf5-meta
Open

Add HDF5 metadata sidecar support (hdf5-meta extension)#157
daltoncass wants to merge 2 commits into
sigmf:mainfrom
daltoncass:feature/hdf5-meta

Conversation

@daltoncass

Copy link
Copy Markdown

Summary

Reference implementation of the optional hdf5-meta SigMF extension: a columnar HDF5 sidecar alongside the .sigmf-meta JSON, for faster and smaller column-oriented metadata access on Recordings with large captures/annotations.

Spec: see the companion PR sigmf/SigMF#355 (adding extensions/hdf5-meta.sigmf-ext.md). Proposal/routing discussion: sigmf/SigMF#354.

What's added

  • sigmf/hdf5.py
    • write_hdf5_sidecar(metadata, path) — write the columnar sidecar.
    • read_hdf5_sidecar(path) -> dict — full metadata dict (compat path).
    • SigMFFileHDF5 — a lazy, columnar reader that keeps the .h5 open and serves captures/annotations as numpy columns/structured arrays without building per-row dicts (this is where the speedup actually lives): annotations_column("core:label"), annotations_array(), num_annotations(), etc. Plus compat bridges get_annotations(), get_captures(), to_sigmffile() that materialize on demand.
    • hdf5.open(path) — open a sidecar directly, zero JSON reads (tight loop).
    • hdf5.fromfile(meta_path) — discover via one JSON read; return the fast reader when a usable, fresh sidecar exists, else a normal SigMFFile.
  • SigMFFile.tofile(write_hdf5=True) — writes the sidecar and declares the extension (core:extensions + hdf5-meta:file). Ignored for archive targets.

Compatibility

  • sigmf.fromfile is unchanged — it always reads pure JSON. The sidecar is only ever touched through the explicit sigmf.hdf5.* entry points. The JSON file remains authoritative; existing behavior is byte-for-byte preserved.
  • h5py is an optional dependency: pip install sigmf[hdf5]. It is lazily imported; without it, the HDF5 entry points raise a clear install hint and hdf5.fromfile falls back to JSON.

Stale-sidecar guard

write_hdf5_sidecar stores a SHA-512 digest of the authoritative JSON as a root attribute. hdf5.fromfile(verify=True) (default) compares it against the parsed JSON and falls back to JSON with a warning on mismatch.

Performance (100k annotations)

  • Sidecar ~27% the size of the JSON metadata.
  • Single-column read ~2.6x faster than parse-all-JSON; vectorized filtering on a structured array beats a full JSON parse. The win comes from not materializing dicts — the dict-compat accessors are deliberately the slow, opt-in path.

Tests

tests/test_hdf5.py (15 tests, skipped cleanly when h5py is absent): exact round-trip incl. heterogeneous annotations / nested objects / booleans / arrays, columnar reads, zero-JSON open, discovery fromfile, stale/corrupt/missing sidecar fallback, structured arrays, overwrite guard, empty arrays.

Usage

import sigmf
from sigmf import hdf5

meta.tofile("recording", write_hdf5=True)          # writes recording.sigmf-meta.h5

with hdf5.open("recording.sigmf-meta.h5") as fast:  # zero JSON, columnar
    starts = fast.annotations_column("core:sample_start")
    table  = fast.annotations_array()

fast = hdf5.fromfile("recording.sigmf-meta")        # discover, prefer fresh sidecar
meta = sigmf.fromfile("recording.sigmf-meta")       # unchanged: pure JSON

Cass Dalton and others added 2 commits June 12, 2026 17:35
Implements the optional hdf5-meta extension: a columnar HDF5 sidecar
alongside the .sigmf-meta JSON for faster, smaller column-oriented
metadata access on Recordings with large captures/annotations arrays.

- sigmf/hdf5.py: writer (write_hdf5_sidecar), full-dict reader
  (read_hdf5_sidecar), and SigMFFileHDF5 — a lazy, columnar reader that
  serves captures/annotations as numpy columns/arrays without building
  per-row dicts (the actual speedup). Entry points hdf5.open (zero JSON)
  and hdf5.fromfile (discover via one JSON read, prefer fresh sidecar).
- SigMFFile.tofile(write_hdf5=True): writes the sidecar and declares the
  extension. sigmf.fromfile is unchanged and always reads pure JSON, so
  existing behavior and the authoritative-JSON contract are preserved.
- Stale-sidecar guard via a source-metadata digest; h5py is an optional
  dependency (pip install sigmf[hdf5]), lazily imported.
- tests/test_hdf5.py: round-trip, columnar reads, discovery, stale/
  corrupt/missing fallback, edge cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a forward path to complement fromfile(): generate_sidecar() reads an
existing .sigmf-meta JSON file, writes the columnar .h5 sidecar alongside
it, and declares the hdf5-meta extension in the JSON so fromfile() can
discover and digest-verify it. A new sigmf_hdf5 console entry point wraps
this for batch use on the command line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant