SysBioChalmers · edkerk · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,11 +4,25 @@ Milestones in the raven-python port. For function-level status see
 [docs/raven_migration.md](https://github.com/SysBioChalmers/raven-python/blob/develop/docs/reference/migration.md); for open work see
 [docs/todo.md](https://github.com/SysBioChalmers/raven-python/blob/develop/docs/reference/todo.md).
 
-## Unreleased
-
-Post-release review pass — cobra-aligned hardening (no behaviour change on
-well-formed inputs). Highlights:
-
+## 0.1.0 — 2026-06-10
+
+First release with **published, downloadable KEGG artefacts**, plus a cobra-aligned
+hardening pass (no behaviour change on well-formed inputs). Highlights:
+
+* **KEGG artefacts published (`kegg116`):** `ensure_kegg_data` /
+  `ensure_kegg_hmm_library` fetch version-pinned, SHA256-verified files from the
+  GitHub release. Every artefact is **gzip + version-prefixed**
+  (`kegg116_<name>.gz`) so MATLAB and Windows read them with the built-in `gunzip`
+  (no external tool) — `organism_gene_ko` moved from xz to gzip for this. The **HMM
+  libraries ship as one gzip concatenated flatfile per domain**
+  (`kegg116_<domain>.hmm.gz`); the client decompresses and `hmmpress`-es once on
+  first use, cutting the download ~10× versus the pressed index and letting the
+  same artefact serve MATLAB RAVEN.
+* **Taxonomy + phylogenetic distance:** publish `kegg116_taxonomy.gz` and add
+  `reconstruction.kegg.phyl_dist` (with `PhylDist`), a faithful port of RAVEN's
+  `getPhylDist` that regenerates the `keggPhylDist` distance matrix from the
+  taxonomy file — so GECKO's organism-distance kcat selection needs no MATLAB
+  `.mat`. `ensure_kegg_taxonomy` fetches the artefact.
 * **Packaging:** `raven_python.__version__` now derives from the installed package
   metadata (`importlib.metadata`) instead of a hard-coded literal that had drifted
   to `0.0.1`; the docs site reported the wrong version. Pinned `ruff==0.15.15` in

diff --git a/IMPROVEMENTS.md b/IMPROVEMENTS.md
@@ -92,13 +92,13 @@ and `taxonomy.py` (3b.3). Maintainer-side, build-time tooling (PLAN.md §2.3b).
 | K5 | EFFICIENCY (portability) | raven-python 🔨 | 🔨 | **KEGG download in pure Python stdlib** (`urllib`/`tarfile`/`gzip`/`netrc`), porting `fetch_keggdb.sh`. Drops the script's `wget`/`tar`/`gunzip` (and Cygwin-on-Windows) requirement, so it runs unchanged on Linux/macOS/Windows; tar extraction uses the `data` filter (no path traversal); same `~/.netrc` credential hygiene. The arrange step is split out (`extract_kegg_dump`) so it's network-free and unit-tested. |
 | K6 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Per-KO multi-FASTA via a stdlib offset index** (`_index_fasta` → seek), replacing `constructMultiFasta`'s Java-`Hashtable` byte scan with 5M-element preallocation. One streaming pass, only wanted ids retained; no MATLAB/Java heap tuning. |
 | K7 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Concatenate per-KO HMMs and `hmmpress` into one pressed library**, so the query path (3b.5) runs a single `hmmscan` against the database instead of RAVEN's thousands of per-KO `hmmsearch` invocations. |
-| K8 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **Drop the `getPhylDist` distance matrix.** Its only uses in RAVEN were per-organism HMM-sequence subsampling (`maxPhylDist`/`nSequences`) and the kingdom filter. Our fixed prok90/euk90 libraries (3b.3) remove the subsampling rationale, and domain mode (3b.4) uses the taxonomy domain classification directly — so the O(n²) matrix is never built. Simpler, faster, less code. |
+| K8 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **Drop the `getPhylDist` distance matrix.** Its only uses in RAVEN were per-organism HMM-sequence subsampling (`maxPhylDist`/`nSequences`) and the kingdom filter. Our fixed prok90/euk90 libraries (3b.3) remove the subsampling rationale, and domain mode (3b.4) uses the taxonomy domain classification directly — so the *reconstruction* path never builds the O(n²) matrix. The matrix itself remains available on demand via `taxonomy.phyl_dist`, which regenerates RAVEN's `keggPhylDist` from the published `taxonomy` artefact for GECKO's organism-distance kcat selection (no `.mat` needed). |
 | K9 | EFFICIENCY (memory) | raven-python 🔨 | 🔨 | **Stream `organism_gene_ko` to disk** in `parse_kegg_dump` instead of building it in memory. Real KEGG has **9.05M** gene↔KO associations; the in-memory DataFrame build OOMs in a few GB. Streaming (now via the external merge sort of K14) runs the full parse with flat, bounded peak memory. (Found by validating against a real KEGG FTP dump.) |
 | K10 | EFFICIENCY (size) | raven-python 🔨 | 🔨 | **Reference model as gzipped RAVEN/cobra YAML** (`reference_model.yml.gz`) rather than SBML: RAVEN-native, MATLAB-readable, and ~1.1 MB vs ~30 MB SBML for the real 12k-reaction model. Made `io/yaml.py` gzip-aware on a `.gz` suffix (general-purpose). |
 | K11 | ERGONOMICS | raven-python 🔨 | 🔨 | **`ensure_data`** (`data.py`): version-pinned registry that fetches/verifies/caches the published KEGG artefacts under `~/.cache/raven-python/data/`, mirroring `ensure_binary`. End users get a draft model with no KEGG access and no manual data handling — the `…_from_artefacts` entry points auto-fetch when no local dir is supplied. |
 | K12 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Fast MAFFT (FFT-NS-2) for HMM training** instead of RAVEN's `--auto`, which selects slow iterative refinement (`dvtditr`) on medium/large KOs — observed ~2.5 min/KO (days for a domain) on real KEGG 118. FFT-NS-2 (`--retree 2 --maxiterate 0`) is seconds/KO and ample for profile-HMM building. **PartTree cutover is residue-based and memory-auto-tuned**: MAFFT memory tracks residues (count × length), not sequence count, so a count threshold let long-protein KOs (K00901: 2,788 seqs, 2.55 M residues) OOM under FFT-NS-2 — measured ~5 GB MAFFT RSS with FFT-NS-2 vs **0.69 GB with PartTree** for the same alignment. The cutover is **length-aware and memory-auto-tuned**: FFT-NS-2 memory is driven by the progressive-alignment **DP cost ≈ n_seqs × mean_len²** (= residues²/n_seqs), *not* residue count — a few hundred long proteins cost far more than the same residues in many short ones. (First tried a residue-only model `RSS≈1.32R²+1.84R`; it then OOM'd on K12047 — 452 seqs but mean length 2082, 0.94 M residues — because long proteins blow the per-residue cost.) Calibrated `RSS_GB ≈ 4.2e-9 × (n_seqs × mean_len²)` across real KEGG KOs (250k/266→0.67 GB … 1.5M/1624→5.73 GB; K12047 cost 1.96e9 = the largest, hence its OOM). `_auto_cost_budget` switches to PartTree when the DP cost exceeds `0.65 × (total − 2.5 GB overhead) / 4.2e-9` (≈7.9e8 on a 7.6 GB box), **warns on low-memory hosts**, and `parttree_residues` overrides with a manual residue cutoff. Back-portable to RAVEN. |
 | K13 | EFFICIENCY | raven-python 🗑️ | 🗑️ | ~~Per-KO sequence cap (`max_sequences`)~~ — **removed.** Briefly added as a count-based cap, but the residue-based PartTree cutover (K12) bounds MAFFT memory without dropping any sequences, so the cap was redundant complexity. All deduplicated sequences are kept. |
-| K14 | EFFICIENCY (size) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Sort `organism_gene_ko` by `(organism, gene)` and store it xz-compressed** (`organism_gene_ko.tsv.xz`), cutting the dominant artefact **≈78 → 27 MB (2.9×)**. Gene IDs within an organism share long prefixes (locus tags, numeric runs), so sorting makes them adjacent and far more compressible (sort alone: 78→48 MB; xz vs gzip captures the cross-row redundancy gzip's 32 KB window misses: →27 MB). The sort is an **external merge sort** bounded to `chunk_rows` rows in memory (sorted runs spooled to gzipped temp files, merged with `heapq.merge`), so it keeps K9's flat memory profile. Both `lzma` and `gzip` are Python stdlib (native on Windows/macOS/Linux, no extra binary); small tables stay gzipped TSV (MATLAB-native), only the big one is xz (MATLAB needs an external `unxz`). Sorted order also matches the by-organism query in `get_kegg_model_for_organism`, enabling a future `searchsorted` slice instead of loading all 9M rows. Back-portable to RAVEN. |
+| K14 | EFFICIENCY (size) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Sort `organism_gene_ko` by `(organism, gene)` and store it gzipped** (`organism_gene_ko.tsv.gz`), cutting the dominant artefact **≈78 → 48 MB** by sorting alone. Gene IDs within an organism share long prefixes (locus tags, numeric runs), so sorting makes them adjacent and far more compressible. The sort is an **external merge sort** bounded to `chunk_rows` rows in memory (sorted runs spooled to gzipped temp files, merged with `heapq.merge`), so it keeps K9's flat memory profile. We first xz-compressed this file (≈27 MB, 2.9×) but switched to **gzip** (≈74 MB) so MATLAB reads it with built-in `gunzip` and no external `unxz`: the artefacts are shared with MATLAB RAVEN, and the once-per-release size cut wasn't worth a MATLAB toolchain dependency. All tables are now gzipped TSV, native on Windows/macOS/Linux. Sorted order also matches the by-organism query in `get_kegg_model_for_organism`, enabling a future `searchsorted` slice instead of loading all 9M rows. Back-portable to RAVEN. |
 | K15 | ERGONOMICS (correctness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Recalibrate the HMM-query KO-assignment defaults** (`assign_kos`): cut-off `1e-50 → 1e-30`, `min_score_ratio_g 0.8 → 0.9`; `min_score_ratio_ko` left at 0.3 but **documented as empirically inert**. Cross-validated the full 3b.5 pipeline against the true KEGG gene→KO annotation of four organisms across both libraries and the well-/lesser-studied axis — *S. cerevisiae*, *Cyanidioschyzon merolae* (red alga), *E. coli* K-12, *Mycoplasma genitalium* (minimal genome). Real annotations score overwhelmingly (median E ≈ 1e-100…1e-155; even the weakest 1% ≈ 1e-15…1e-36) while spurious hits cluster at ≈1e-8 — a ~20-order-of-magnitude gap. RAVEN's `1e-50` therefore sits **inside the true-positive tail** and silently drops real-but-divergent hits for no noise-rejection gain: gene→KO recall on *M. genitalium* was only 0.84 (reaction recall 0.87). At `1e-30` + `ratio_g=0.9`: *M. genitalium* recall **0.84→0.94** (rxn 0.87→0.97), *E. coli* 0.95→0.97 with **fewer** unannotated reactions (198→173, the tighter gene-ratio prunes spurious multi-KO genes), *S. cerevisiae*/*C. merolae* held or improved. The three sweep tables showed `min_score_ratio_ko` produced identical output at 0.0/0.3/0.5 across all four organisms — a magic-number knob that does nothing; `min_score_ratio_g` is the real precision lever. Full numbers in [docs/kegg_hmm_cutoff_calibration.md](https://github.com/SysBioChalmers/raven-python/blob/develop/docs/studies/kegg_hmm_cutoff_calibration.md) (reproduce with `scripts/analyze_hmm_cutoffs.py`). Back-portable to RAVEN. |
 
 ## FSEOF (Phase 5 — implemented, redesigned)

diff --git a/data/manifest.example.json b/data/manifest.example.json
@@ -1,20 +1,21 @@
 {
   "manifest_version": 1,
-  "generated": "2026-05-30",
+  "generated": "2026-06-10",
   "data": {
     "kegg": {
       "version": "kegg116",
       "description": "KEGG reference model, KO/reaction tables, and prokaryote/eukaryote HMM libraries for getKEGGModelForOrganism.",
       "license": "Derived from the KEGG database; redistributed with permission from KEGG.",
-      "doi": "10.5281/zenodo.0000000",
-      "source": "https://github.com/SysBioChalmers/raven-python/releases/tag/kegg-kegg116",
+      "source": "https://github.com/SysBioChalmers/raven-python/releases/tag/v0.1.0",
       "files": {
-        "reference_model.yml.gz":   { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/reference_model.yml.gz",   "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
-        "ko_reaction.tsv.gz":       { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_reaction.tsv.gz",       "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
-        "ko_names.tsv.gz":          { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_names.tsv.gz",          "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
-        "organism_gene_ko.tsv.xz":  { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/organism_gene_ko.tsv.xz",  "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
-        "rxn_flags.tsv.gz":         { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/rxn_flags.tsv.gz",         "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
-        "prokaryotes.hmm":          { "url": "https://zenodo.org/records/0000000/files/prokaryotes.hmm",                                            "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
+        "kegg116_reference_model.yml.gz":  { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_reference_model.yml.gz",  "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_ko_reaction.tsv.gz":      { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_ko_reaction.tsv.gz",      "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_ko_names.tsv.gz":         { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_ko_names.tsv.gz",         "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_organism_gene_ko.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_organism_gene_ko.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_rxn_flags.tsv.gz":        { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_rxn_flags.tsv.gz",        "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_taxonomy.gz":             { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_taxonomy.gz",             "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_prokaryotes.hmm.gz":      { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_prokaryotes.hmm.gz",      "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
+        "kegg116_eukaryotes.hmm.gz":       { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_eukaryotes.hmm.gz",       "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
       }
     }
   },

diff --git a/data/manifest.json b/data/manifest.json
@@ -1,6 +1,55 @@
 {
   "manifest_version": 1,
-  "generated": "2026-05-30",
-  "data": {},
+  "generated": "2026-06-10",
+  "data": {
+    "kegg": {
+      "version": "kegg116",
+      "description": "KEGG reference model, KO/reaction tables, taxonomy, and prokaryote/eukaryote HMM libraries for getKEGGModelForOrganism.",
+      "license": "Derived from the KEGG database; redistributed with permission from KEGG.",
+      "source": "https://github.com/SysBioChalmers/raven-python/releases/tag/v0.1.0",
+      "files": {
+        "kegg116_eukaryotes.hmm.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_eukaryotes.hmm.gz",
+          "sha256": "2d48bc9935575d0f9ba4178bf2df19279bff866b49c1bf83a8e15787b11d6708",
+          "bytes": 134002309
+        },
+        "kegg116_ko_names.tsv.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_ko_names.tsv.gz",
+          "sha256": "84f9c7150172d948f794d91a6608d55f7140f31e53249c705057ae49b11c93b3",
+          "bytes": 14585
+        },
+        "kegg116_ko_reaction.tsv.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_ko_reaction.tsv.gz",
+          "sha256": "e1a4ac22875bd3030d03b78368b0153b6d99000acb2ee0f474340a03c180323c",
+          "bytes": 49196
+        },
+        "kegg116_organism_gene_ko.tsv.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_organism_gene_ko.tsv.gz",
+          "sha256": "27bf7dd58eb1acd5904990dc2be187aae4d8d9b9f7421375618e7c8d6ff7253d",
+          "bytes": 47935249
+        },
+        "kegg116_prokaryotes.hmm.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_prokaryotes.hmm.gz",
+          "sha256": "d80cb2a22dec9fd8336b3998e3b96ee121672f63f4041cddaf09624fe739f1af",
+          "bytes": 153173750
+        },
+        "kegg116_reference_model.yml.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_reference_model.yml.gz",
+          "sha256": "73ff313fe2aa2830ec511f4e522226c98c5714c2d5c4632844544e5a409c7f0c",
+          "bytes": 1090563
+        },
+        "kegg116_rxn_flags.tsv.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_rxn_flags.tsv.gz",
+          "sha256": "c4c134effc9edeeb74b925ae8616320af162edbaad3a9b44dcc29d2c4d12db9b",
+          "bytes": 33289
+        },
+        "kegg116_taxonomy.gz": {
+          "url": "https://github.com/SysBioChalmers/raven-python/releases/download/v0.1.0/kegg116_taxonomy.gz",
+          "sha256": "1edc56da94d71433e5f08c133600292c311baaf33279a959518ab08389b0e538",
+          "bytes": 234693
+        }
+      }
+    }
+  },
   "binaries": {}
 }