Skip to content

Publish kegg116 KEGG artefacts + phyl_dist (v0.1.0)#29

Merged
edkerk merged 1 commit into
developfrom
release/0.1.0-kegg116
Jun 11, 2026
Merged

Publish kegg116 KEGG artefacts + phyl_dist (v0.1.0)#29
edkerk merged 1 commit into
developfrom
release/0.1.0-kegg116

Conversation

@edkerk

@edkerk edkerk commented Jun 10, 2026

Copy link
Copy Markdown
Member

See CHANGELOG 0.1.0.

First downloadable KEGG artefact set, wired into the runtime resolvers:

- All artefacts are gzip and version-prefixed (kegg116_<name>.gz) so MATLAB and
  Windows read them with the built-in gunzip, no external tool. organism_gene_ko
  moves from xz to gzip for the same reason.
- HMM libraries ship as one gzip concatenated flatfile per domain;
  ensure_kegg_hmm_library decompresses and hmmpresses on first use, ~10x smaller
  than the pressed index and portable across HMMER versions.
- Add a version-prefix-tolerant artefact resolver (_resolve_artefact) used by the
  organism/sequence entry points; parse_kegg_dump and build_kegg_artefacts.py gain
  an opt-in --version.
- Populate data/manifest.json and _DATA_REGISTRY with the kegg116 release assets
  (real SHA256 + bytes); refresh the maintainer docs and manifest example.
- Bump version to 0.1.0 and update CHANGELOG.

Add KEGG taxonomy artefact and phyl_dist (RAVEN getPhylDist port)

Publish kegg116_taxonomy.gz and regenerate RAVEN's keggPhylDist from it, so GECKO's
organism-distance kcat selection needs no MATLAB .mat file:

- reconstruction.kegg.phyl_dist + PhylDist faithfully reproduce RAVEN getPhylDist's
  (asymmetric, occasionally negative) distance metric; parse_taxonomy_records exposes
  ids/names/lineages and reads .gz transparently.
- data.ensure_kegg_taxonomy fetches the artefact; build_kegg_artefacts.py emits it.
- Register kegg116_taxonomy.gz in data/manifest.json and _DATA_REGISTRY (8 files).
- Tests for phyl_dist (hand-checked against RAVEN) and the taxonomy fetch; update
  migration/IMPROVEMENTS/maintainer docs and CHANGELOG.

Bundle core KEGG artefacts into kegg116_core.tar.gz

Combine the five core model files (reference model + KO/reaction/organism-gene/
rxn-flag tables) into one kegg116_core.tar.gz; HMM libraries and taxonomy stay
separate. The release drops from 8 assets to 4.

- ensure_kegg_data now fetches the single bundle, SHA-verifies it, and extracts the
  version-prefixed members into the cache once (safe extraction, matching download.py).
- build_kegg_artefacts.py groups the core files into the bundle after the HMM step.
- Regenerate data/manifest.json and _DATA_REGISTRY (4 entries); update manifest.example,
  tests (bundle fixture), and docs.
@edkerk edkerk force-pushed the release/0.1.0-kegg116 branch from 346b1e4 to 3143a25 Compare June 10, 2026 23:10
@edkerk edkerk merged commit 599f260 into develop Jun 11, 2026
6 checks passed
@edkerk edkerk deleted the release/0.1.0-kegg116 branch June 11, 2026 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant