Skip to content

Cobra-aligned hardening pass from full code review#18

Merged
edkerk merged 9 commits into
developfrom
fix/cobra-aligned-hardening
Jun 8, 2026
Merged

Cobra-aligned hardening pass from full code review#18
edkerk merged 9 commits into
developfrom
fix/cobra-aligned-hardening

Conversation

@edkerk

@edkerk edkerk commented Jun 7, 2026

Copy link
Copy Markdown
Member

What

Cobra-aligned hardening pass distilled from a full code review of the port:

  • Raise OptimizationError on infeasible/error solver status instead of returning silently; switch stray print to the module logger.
  • Route subsystem coercion through a single utils.parse.subsystem_to_str (consistent handling of str / list / nested).
  • Extra guards: safe zip traversal, penalty > 0, NaN / z-score handling, version-from-metadata fallback.
  • Fixes a bug the review itself missed: comparison.compare counted only the first of a reaction's multiple subsystems.

Base / dependency

Builds on feat/yeast-gem-shared (PR #17 — the shared yeast-GEM modules). The first 7 commits here are #17's; the final commit (8930b10) is the hardening pass. Recommend merging #17 first, after which this reduces to the single hardening commit (rebase onto develop).

edkerk added 9 commits May 31, 2026 01:03
Lands the upstream-shareable pieces that yeast-GEM has been implementing
locally during its Python port (see yeast-GEM/code/python/PORTING_PLAN.md
and UPSTREAM_CANDIDATES.md). These are organism-agnostic; yeast-GEM
will consume them via a Python dependency on raven-python.

New modules
-----------
raven_python.comparison.diff
    diff_models(a, b, ...) -> DiffReport — strict two-model semantic-
    equality diff. Complements the existing compare_models (N-model
    presence-matrix overview). Used as a CI gate to verify that two
    toolchains (e.g. MATLAB RAVEN vs raven_python, pre/post refactor
    of one toolchain) produce equivalent models. Includes a
    python -m raven_python.comparison.diff CLI.

raven_python.annotation.sbo
    add_sbo_terms — SBO term assignment with "fill" semantic. Default
    parameter set reproduces yeast-GEM's behaviour; biomass metabolite
    names, biomass/NGAM reaction names, and pseudoreaction substrings
    are overridable. Transport detection is pluggable (default: same-
    met-name in two compartments). Includes an `only_last_reaction_
    for_pseudo` legacy bug-compat flag for yeast-GEM's lock-step
    migration; off by default for any new caller.

raven_python.annotation.delta_g
    load_delta_g_csv / save_delta_g_csv — generic side-car CSV
    persistence for scalar notes (ΔG by default, but the note key,
    column names, and id/value mapping are all configurable).

raven_python.conditions.apply
    apply_condition(model, yaml_or_dict) — generic "apply this YAML
    condition" loader. Schema: prelude (reset_exchanges),
    cofactor_pseudoreaction (remove_mets + charge_balance_met),
    biomass_stoichiometry_delta, per-rxn bounds, expected_uptake_count.
    Project-specific extensions (e.g. yeast-GEM's amino_acid_ratio)
    are handled by the caller before/after this function — kept
    upstream-narrow on purpose. Also exposes set_reaction_bounds
    helper that bypasses cobra's lb<=ub validator for the (legitimate)
    cases where a condition lands on an infeasible bound state.

Tests
-----
46 new tests across the three modules; full pre-existing raven-python
suite still passes (519 passed; 1 unrelated pre-existing openpyxl
ImportError in tests/test_io_git.py; 2 skipped). ruff clean.

Not in this commit
------------------
The biomass / GAM / chemostat / fit_gam modules are still tracked as
upstream candidates in yeast-GEM/code/python/UPSTREAM_CANDIDATES.md
and remain local in yeast-GEM until phase 4 of the port (which would
ideally land them directly here).
Generic biomass-equation manipulation, extracted from yeast-GEM's
sumBioMass / scaleBioMass / rescalePseudoReaction / changeGAM as
yeast-GEM moves to depend on raven-python (yeast-GEM phase 4 of the
porting plan).

Module layout
-------------
raven_python.biomass.config
    BiomassConfig — biomass_rxn id + proton_met id + ordered tuple
    of BiomassComponent (per-component pseudoreaction name + mass-
    computation strategy).

raven_python.biomass.scale
    sum_biomass(model, config) → {component: g/gDW, ..., "total": X}
    rescale_pseudoreaction(model, config, name, factor) — multiply
        the pseudoreaction's substrate coefs by factor and rebalance
        H+ to keep ionic charge at zero.
    scale_biomass(model, config, name, new_value, balance_out=None) —
        rescale a component to a target fraction, optionally balancing
        a second component so the total stays at 1 g/gDW.

raven_python.biomass.gam
    set_gam(model, value, *, biomass_rxn, cofactor_met_names,
            ngam_rxn=None, ngam_value=None) — scales every metabolite
    in the biomass pseudoreaction whose `name` is in the supplied
    cofactor set, preserving its sign; optionally fixes the NGAM rxn
    bounds.

Mass strategies (per BiomassComponent.mass_strategy):
    "mw"               plain MW from chemical formula (carbohydrate /
                       ion / cofactor)
    "mw_minus_2h"      MW − 2.016 g/mol per substrate (protein —
                       charged tRNAs release two protons)
    "mw_minus_water"   MW − 18.015 g/mol per substrate (RNA / DNA —
                       polymerisation releases one water)
    "grams"            stoichiometry already in g/gDW (lipid backbone)

Tests: 19 new tests over a synthetic toy model that exercises every
mass strategy, the H+ charge rebalance, scale_biomass with and
without balance_out, set_gam on cofactor mets (and the NGAM bound
path).
…iant)

Detection-only counterpart to remove_duplicate_reactions. Returns
duplicate groups instead of mutating the model. Ignores bounds /
GPR / objective — only stoichiometry is compared, mirroring the
typical curation use case ("find reactions that could be merged").

A new ``ignore_direction=True`` default (yeast-GEM convention)
treats A→B and B→A as duplicates. Set False to require identical
orientation.

Used by yeast-GEM's modelTests port (Tier 3 / phase 5) to flag
duplicate reactions during curation review.
Generic batch curation engine extracted from yeast-GEM's MATLAB
curateMetsRxnsGenes (yeast-GEM phase 6). Adds or updates
metabolites, reactions and genes from pandas DataFrames; a
batch_curate_from_tsv convenience wrapper reads the equivalent TSVs.

Schema (matches yeast-GEM's data/modelCuration/template/ layout):

  mets_df          metNames, comps, formula, charge, inchi, metNotes
                   + any number of MIRIAM-namespace columns
  genes_df         genes, geneShortNames + MIRIAM columns
  rxns_df          rxnNames, grRules, lb, ub, rev, subSystems,
                   eccodes, rxnNotes, rxnReferences,
                   rxnConfidenceScores + MIRIAM columns
  rxns_coeffs_df   rxnNames, metNames, comps, coefficient
                   (one row per (reaction, metabolite) pair)

Match keys:
  metabolites — (name, compartment) tuple
  genes       — gene id
  reactions   — stoichiometric signature

Existing entities get their annotations overwritten (warning emitted);
new entities are added with fresh ids generated from the supplied
``met_id_prefix`` / ``rxn_id_prefix`` (defaults M_ / R_ per the BiGG
convention; yeast-GEM passes s_ / r_). Width of the existing
zero-padded suffix is preserved so s_0001 → s_0002, not s_2.

"Everything after the core columns is MIRIAM" — the header of any
extra column becomes the annotation namespace key. Matches MATLAB
behaviour exactly so yeast-GEM's existing TSVs work unchanged, and
projects with different MIRIAM column sets need no code change.

CurationResult dataclass records what was added vs updated so
callers can verify in tests / CI.

Tests: 13 new (add/update mets, add/update genes, add/update rxns
by stoichiometry, miriam auto-detect, id-width preservation,
combined mets+rxns in one call, missing-metabolite error,
batch_curate_from_tsv round trip).
Three things this fixes:

  1. write_yaml_model dropped the !!omap tags entirely. _to_plain
     was flattening cobra's OrderedDict to plain dict, which causes
     ruamel to emit ordinary block mappings. RAVEN MATLAB's reader
     is a line-based parser keyed on !!omap and therefore could not
     load any file we wrote. _to_plain now returns OrderedDict so
     ruamel re-emits the !!omap tag.

  2. eccodes was lost on round-trip — it wasn't in _RXN_FIELDS, so
     read_yaml_model didn't capture it into .notes and
     write_yaml_model couldn't lift it back. Added.

  3. RAVEN MATLAB writes reaction notes as 'rxnNotes'; cobrapy and
     this writer use 'notes'. Added a read-time alias so existing
     yeast-GEM YAML files (which still say 'rxnNotes') load
     cleanly. Writes go out as 'notes' (cobrapy-canonical).

Top-level layout now matches RAVEN MATLAB: metaData first, then
metabolites / reactions / genes / compartments, then optional
gecko_light + ec-rxns + ec-enzymes. id/name/version live inside
metaData (RAVEN convention) — cobrapy reading these files still
works, but cobra_model.id ends up None because cobrapy doesn't
know about metaData. raven_python.read_yaml_model lifts both
metaData.id/name/version onto model.id / model.name /
model.notes['version'] so the rest of the codebase doesn't care
which layout the file used.

Empty-name genes are no longer emitted as  — that's a
cobrapy quirk that drifts yeast-GEM YAML files away from RAVEN
MATLAB's output.

Verified end-to-end:

  *  cobra.io.load_yaml_model reads every file the new writer
     produces (yeast-GEM and a synthetic fixture).
  *  RAVEN MATLAB readYAMLmodel reads every file the new writer
     produces.
  *  Round-tripping yeast-GEM through raven_python preserves
     2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes,
     2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG,
     1788 SMILES, 1443 rxn-notes — no semantic drift.

Tests
-----
  *  tests/test_io_yaml_parity.py is new: covers every RAVEN
     extension, the rxnNotes legacy alias, the SMILES YAML-special
     character case, metaData-first layout, and cobra readability.
  *  tests/test_io_yaml.py::test_output_is_cobra_readable adjusts
     for the metaData layout (cobra recovers mets/rxns/annotation
     but not model.id, by design).
PyYAML is not a project dependency; raven-python uses ruamel.yaml
(already pulled in via cobra) everywhere else. The conditions
module and its tests still imported PyYAML, which broke pytest
collection on clean CI runners with 'No module named yaml'.

Both apply.py and the test now use a YAML(typ='safe') instance
from ruamel.yaml — same plain-dict semantics as PyYAML's
safe_load / safe_dump, no new dependency.
Adds docs/reference/yaml_format.md as the canonical schema reference
for the cross-toolchain YAML format (cobrapy / raven-python / RAVEN
MATLAB). Covers the full document shape, per-entry field order,
RAVEN extensions, the GECKO ec-* sections, the metaData provenance
block, number / string / quoting rules, and the cross-reader
interoperability matrix. Linked from docs/reference/index.md and
the I/O guide.

Reader fix: pre-shim RAVEN MATLAB writes emitted GECKO models
with geckoLight: "true" inside the metaData block (not as a
top-level gecko_light). The reader now lifts that legacy key out
of metaData so model.ec.gecko_light is populated whichever
placement the file used. Round-trip writes always use the new
top-level form.

Regression tests:

  test_pre_shim_format_loads — synthetic fixture covering every
  legacy quirk we know about (--- doc marker, plain metaData,
  geckoLight inside metaData, top-level metabolite smiles,
  rxnNotes reaction key, integer bounds, double-quoted strings).
  Each quirk has its own assertion + comment.

  test_pre_shim_yeast_gem_loads_if_available — sanity-loads the
  real yeast-GEM.yml (2748 mets, 4102 rxns, 1143 genes) and
  asserts the documented preserved-counts table from the format
  reference. Skipped on CI runners where the working copy isn't
  mounted.
No behaviour change on well-formed inputs. Highlights:

- Packaging: derive __version__ from package metadata (was a stale
  hard-coded "0.0.1" that the docs site reported); pin ruff==0.15.15 in
  the dev extra and CI; fix two lint errors unpinned ruff started flagging.
- Errors: solver/feasibility failures in run_init, run_ftinit, fill_tasks
  and random_sampling now raise cobra.exceptions.OptimizationError instead
  of bare RuntimeError (consistent with the rest of the package).
- Consistency: single utils.parse.subsystem_to_str coerces reaction
  subsystem to cobra's canonical str across io.excel / comparison.compare /
  curation.batch / manipulation.add (fixes a crash on non-string items and
  the silent drop of multi-subsystem reactions); shared GPR score
  aggregators in utils.gpr used by init.score and init.genes; KEGG-download
  progress uses a module logger instead of print.
- Robustness: zip path-traversal guard in binaries.py; penalty>0 check in
  connect_blocked_reactions; NaN-sample guard in random_sampling; all-zero
  ec coupling warning; optional verify= SHA256 re-check on data cache hits;
  non-finite z-score guard in reporter. Regression tests added for each.
@edkerk edkerk merged commit f44d295 into develop Jun 8, 2026
5 checks passed
@edkerk edkerk deleted the fix/cobra-aligned-hardening branch June 8, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant