TimeCap

Dataset-agnostic pipeline that turns raw multi-channel sensor recordings into language-model-ready captions. The current reference implementation targets the MyHeartCounts (MHC) wearable dataset at daily and weekly resolution.

Pipeline overview

HuggingFace dataset ──► Transformer ──► Recording ──► Annotator ──► CaptionResult
                                                       │
                                                       ├── StructuralExtractor   (trends, spikes)
                                                       ├── SemanticExtractor     (active windows)
                                                       └── CrossChannelExtractor (workout / sleep)

Transformer maps a source-dataset row to the internal Recording schema (per-channel float arrays + metadata).
Annotator runs a list of CaptionExtractors and returns one Annotation per (channel, time-window, caption-type) it observes.
Caption phrasing is template-driven (templates/templates.json, templates/templates_hourly.json); each annotation deterministically picks a template variant from a row-derived seed.

Adding a new source dataset means writing one Dataset, one Transformer, and a ChannelConfig describing channel names, units, and aggregators — no changes to the core pipeline.

Setup

python3 -m pip install -r requirements.txt

Point the loader at a HuggingFace MHC export:

export MHC_DATASET_DIR=<path-to-daily-hf-dataset>     # for daily
export MHC_WEEKLY_DATASET_DIR=<path-to-weekly-hf>     # for weekly

Generating captions

Single-process export:

python scripts/export_captions.py \
    --variant weekly \
    --out exports/lean_full

Common flags:

Flag	Default	Notes
`--variant {daily,weekly}`	`weekly`	Which MHC resolution to caption.
`--out <dir>`	`exports/lean_full`	Output directory for Arrow shards.
`--max_rows <n>`	unset	Cap row count for a smoke test.
`--start <i>` / `--end <j>`	full range	Slice for parallel/sharded runs.
`--min_wear_pct <p>`	`0.0`	(daily) drop low-wear days.
`--min_valid_hours <h>`	`0`	(weekly) drop weeks with too few valid hours.
`--min_active_channels <k>`	`0`	Drop rows with fewer active channels.
`--split_file <json>`	unset	Use canonical sharable-user splits.

Each shard writes an Arrow file under <out>/recordings_*.arrow containing the Recording data plus all Annotations.

Parallel sharded export

For large jobs the helper script splits the dataset into N Slurm jobs:

export MHC_WEEKLY_DATASET_DIR=<path>
./scripts/export_captions_sharded.sh weekly 4 exports/lean_full

Library usage

from annotator import Annotator
from captionizer import Captionizer
from extractors.semantic import SemanticExtractor
from extractors.structural import StructuralExtractor
from mhc.constants import MHC_CHANNEL_CONFIG
from mhc.cross_channel import default_extractor
from mhc.dataset import MHCDataset
from mhc.transformer import MHCTransformer

dataset = MHCDataset()
annotator = Annotator([
    StructuralExtractor(MHC_CHANNEL_CONFIG),
    SemanticExtractor(MHC_CHANNEL_CONFIG),
    default_extractor(MHC_CHANNEL_CONFIG),
])
captionizer = Captionizer(dataset, MHCTransformer(), annotator)
result, _ = captionizer.run(max_rows=10)

Inspecting the output

The interactive explorer steps through one row at a time, switches signals, and overlays detector events on the time series:

python explorer.py --min-wear-pct=50.0

WESAD

WESAD support adds a second dataset path for the same captioning pipeline. It lets you work with the raw WESAD sensor recordings, preprocess them into the stored timef format on demand, and then inspect or export them with the same annotator/explorer flow used for MHC.

Typical workflow:

python3 explorer.py --wesad
python3 scripts/export_wesad_captions.py --out exports/wesad_poc

If you already have a stored dataset, you can still point the tools at it with --wesad-dataset-dir /path/to/wesad_timef. If not, the WESAD dataset adapter will fall back to raw subject folders under data/WESAD and the transformer will run the preprocessing steps inline.

The WESAD mode includes:

a dedicated wesad/ dataset adapter and transformer
WESAD-specific templates and synthesizers for stress, recovery, and amusement
support for either a prebuilt stored dataset or raw WESAD folders
the same downstream explorer/export pipeline used for the other datasets

Layout

captionizer.py        Orchestration (Dataset + Transformer + Annotator)
annotator.py          Runs a list of extractors over one Recording
extractors/           Caption extractors (statistical / structural / semantic / cross-channel)
detectors/            Trend and spike detectors used by structural extractor
synthesizers/         Cross-channel caption synthesizers (workout, sleep, …)
templates/            Caption phrasing templates
mhc/                  MHC daily dataset adapter
mhc_weekly/           MHC weekly dataset adapter
timef/                Internal Recording / CaptionResult schema
exporters/            Arrow shard writer
scripts/              CLI entry points (caption export, training, eval)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
LICENSES		LICENSES
detectors		detectors
evaluation		evaluation
exporters		exporters
extractors		extractors
mhc		mhc
mhc_weekly		mhc_weekly
models		models
scripts		scripts
synthesizers		synthesizers
templates		templates
time_series_datasets		time_series_datasets
timef		timef
wesad		wesad
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
REUSE.toml		REUSE.toml
aggregators.py		aggregators.py
annotator.py		annotator.py
captionizer.py		captionizer.py
curriculum_learning.py		curriculum_learning.py
explorer.py		explorer.py
requirements.txt		requirements.txt
reviewer.py		reviewer.py
transformer.py		transformer.py
util.py		util.py
visualizer.py		visualizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TimeCap

Pipeline overview

Setup

Generating captions

Parallel sharded export

Library usage

Inspecting the output

WESAD

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TimeCap

Pipeline overview

Setup

Generating captions

Parallel sharded export

Library usage

Inspecting the output

WESAD

Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages