Dataset-agnostic pipeline that turns raw multi-channel sensor recordings into language-model-ready captions. The current reference implementation targets the MyHeartCounts (MHC) wearable dataset at daily and weekly resolution.
HuggingFace dataset ──► Transformer ──► Recording ──► Annotator ──► CaptionResult
│
├── StructuralExtractor (trends, spikes)
├── SemanticExtractor (active windows)
└── CrossChannelExtractor (workout / sleep)
Transformermaps a source-dataset row to the internalRecordingschema (per-channel float arrays + metadata).Annotatorruns a list ofCaptionExtractors and returns oneAnnotationper (channel, time-window, caption-type) it observes.- Caption phrasing is template-driven (
templates/templates.json,templates/templates_hourly.json); each annotation deterministically picks a template variant from a row-derived seed.
Adding a new source dataset means writing one Dataset, one Transformer, and
a ChannelConfig describing channel names, units, and aggregators — no changes
to the core pipeline.
python3 -m pip install -r requirements.txtPoint the loader at a HuggingFace MHC export:
export MHC_DATASET_DIR=<path-to-daily-hf-dataset> # for daily
export MHC_WEEKLY_DATASET_DIR=<path-to-weekly-hf> # for weeklySingle-process export:
python scripts/export_captions.py \
--variant weekly \
--out exports/lean_fullCommon flags:
| Flag | Default | Notes |
|---|---|---|
--variant {daily,weekly} |
weekly |
Which MHC resolution to caption. |
--out <dir> |
exports/lean_full |
Output directory for Arrow shards. |
--max_rows <n> |
unset | Cap row count for a smoke test. |
--start <i> / --end <j> |
full range | Slice for parallel/sharded runs. |
--min_wear_pct <p> |
0.0 |
(daily) drop low-wear days. |
--min_valid_hours <h> |
0 |
(weekly) drop weeks with too few valid hours. |
--min_active_channels <k> |
0 |
Drop rows with fewer active channels. |
--split_file <json> |
unset | Use canonical sharable-user splits. |
Each shard writes an Arrow file under <out>/recordings_*.arrow containing the
Recording data plus all Annotations.
For large jobs the helper script splits the dataset into N Slurm jobs:
export MHC_WEEKLY_DATASET_DIR=<path>
./scripts/export_captions_sharded.sh weekly 4 exports/lean_fullfrom annotator import Annotator
from captionizer import Captionizer
from extractors.semantic import SemanticExtractor
from extractors.structural import StructuralExtractor
from mhc.constants import MHC_CHANNEL_CONFIG
from mhc.cross_channel import default_extractor
from mhc.dataset import MHCDataset
from mhc.transformer import MHCTransformer
dataset = MHCDataset()
annotator = Annotator([
StructuralExtractor(MHC_CHANNEL_CONFIG),
SemanticExtractor(MHC_CHANNEL_CONFIG),
default_extractor(MHC_CHANNEL_CONFIG),
])
captionizer = Captionizer(dataset, MHCTransformer(), annotator)
result, _ = captionizer.run(max_rows=10)The interactive explorer steps through one row at a time, switches signals, and overlays detector events on the time series:
python explorer.py --min-wear-pct=50.0WESAD support adds a second dataset path for the same captioning pipeline. It
lets you work with the raw WESAD sensor recordings, preprocess them into the
stored timef format on demand, and then inspect or export them with the same
annotator/explorer flow used for MHC.
Typical workflow:
python3 explorer.py --wesad
python3 scripts/export_wesad_captions.py --out exports/wesad_pocIf you already have a stored dataset, you can still point the tools at it with
--wesad-dataset-dir /path/to/wesad_timef. If not, the WESAD dataset adapter
will fall back to raw subject folders under data/WESAD and the transformer
will run the preprocessing steps inline.
The WESAD mode includes:
- a dedicated
wesad/dataset adapter and transformer - WESAD-specific templates and synthesizers for stress, recovery, and amusement
- support for either a prebuilt stored dataset or raw WESAD folders
- the same downstream explorer/export pipeline used for the other datasets
captionizer.py Orchestration (Dataset + Transformer + Annotator)
annotator.py Runs a list of extractors over one Recording
extractors/ Caption extractors (statistical / structural / semantic / cross-channel)
detectors/ Trend and spike detectors used by structural extractor
synthesizers/ Cross-channel caption synthesizers (workout, sleep, …)
templates/ Caption phrasing templates
mhc/ MHC daily dataset adapter
mhc_weekly/ MHC weekly dataset adapter
timef/ Internal Recording / CaptionResult schema
exporters/ Arrow shard writer
scripts/ CLI entry points (caption export, training, eval)