Cobra-aligned hardening pass from full code review by edkerk · Pull Request #18 · SysBioChalmers/raven-python

edkerk · 2026-06-07T22:02:20Z

What

Cobra-aligned hardening pass distilled from a full code review of the port:

Raise OptimizationError on infeasible/error solver status instead of returning silently; switch stray print to the module logger.
Route subsystem coercion through a single utils.parse.subsystem_to_str (consistent handling of str / list / nested).
Extra guards: safe zip traversal, penalty > 0, NaN / z-score handling, version-from-metadata fallback.
Fixes a bug the review itself missed: comparison.compare counted only the first of a reaction's multiple subsystems.

Base / dependency

Builds on feat/yeast-gem-shared (PR #17 — the shared yeast-GEM modules). The first 7 commits here are #17's; the final commit (8930b10) is the hardening pass. Recommend merging #17 first, after which this reduces to the single hardening commit (rebase onto develop).

Lands the upstream-shareable pieces that yeast-GEM has been implementing locally during its Python port (see yeast-GEM/code/python/PORTING_PLAN.md and UPSTREAM_CANDIDATES.md). These are organism-agnostic; yeast-GEM will consume them via a Python dependency on raven-python. New modules ----------- raven_python.comparison.diff diff_models(a, b, ...) -> DiffReport — strict two-model semantic- equality diff. Complements the existing compare_models (N-model presence-matrix overview). Used as a CI gate to verify that two toolchains (e.g. MATLAB RAVEN vs raven_python, pre/post refactor of one toolchain) produce equivalent models. Includes a python -m raven_python.comparison.diff CLI. raven_python.annotation.sbo add_sbo_terms — SBO term assignment with "fill" semantic. Default parameter set reproduces yeast-GEM's behaviour; biomass metabolite names, biomass/NGAM reaction names, and pseudoreaction substrings are overridable. Transport detection is pluggable (default: same- met-name in two compartments). Includes an `only_last_reaction_ for_pseudo` legacy bug-compat flag for yeast-GEM's lock-step migration; off by default for any new caller. raven_python.annotation.delta_g load_delta_g_csv / save_delta_g_csv — generic side-car CSV persistence for scalar notes (ΔG by default, but the note key, column names, and id/value mapping are all configurable). raven_python.conditions.apply apply_condition(model, yaml_or_dict) — generic "apply this YAML condition" loader. Schema: prelude (reset_exchanges), cofactor_pseudoreaction (remove_mets + charge_balance_met), biomass_stoichiometry_delta, per-rxn bounds, expected_uptake_count. Project-specific extensions (e.g. yeast-GEM's amino_acid_ratio) are handled by the caller before/after this function — kept upstream-narrow on purpose. Also exposes set_reaction_bounds helper that bypasses cobra's lb<=ub validator for the (legitimate) cases where a condition lands on an infeasible bound state. Tests ----- 46 new tests across the three modules; full pre-existing raven-python suite still passes (519 passed; 1 unrelated pre-existing openpyxl ImportError in tests/test_io_git.py; 2 skipped). ruff clean. Not in this commit ------------------ The biomass / GAM / chemostat / fit_gam modules are still tracked as upstream candidates in yeast-GEM/code/python/UPSTREAM_CANDIDATES.md and remain local in yeast-GEM until phase 4 of the port (which would ideally land them directly here).

Generic biomass-equation manipulation, extracted from yeast-GEM's sumBioMass / scaleBioMass / rescalePseudoReaction / changeGAM as yeast-GEM moves to depend on raven-python (yeast-GEM phase 4 of the porting plan). Module layout ------------- raven_python.biomass.config BiomassConfig — biomass_rxn id + proton_met id + ordered tuple of BiomassComponent (per-component pseudoreaction name + mass- computation strategy). raven_python.biomass.scale sum_biomass(model, config) → {component: g/gDW, ..., "total": X} rescale_pseudoreaction(model, config, name, factor) — multiply the pseudoreaction's substrate coefs by factor and rebalance H+ to keep ionic charge at zero. scale_biomass(model, config, name, new_value, balance_out=None) — rescale a component to a target fraction, optionally balancing a second component so the total stays at 1 g/gDW. raven_python.biomass.gam set_gam(model, value, *, biomass_rxn, cofactor_met_names, ngam_rxn=None, ngam_value=None) — scales every metabolite in the biomass pseudoreaction whose `name` is in the supplied cofactor set, preserving its sign; optionally fixes the NGAM rxn bounds. Mass strategies (per BiomassComponent.mass_strategy): "mw" plain MW from chemical formula (carbohydrate / ion / cofactor) "mw_minus_2h" MW − 2.016 g/mol per substrate (protein — charged tRNAs release two protons) "mw_minus_water" MW − 18.015 g/mol per substrate (RNA / DNA — polymerisation releases one water) "grams" stoichiometry already in g/gDW (lipid backbone) Tests: 19 new tests over a synthetic toy model that exercises every mass strategy, the H+ charge rebalance, scale_biomass with and without balance_out, set_gam on cofactor mets (and the NGAM bound path).

…iant) Detection-only counterpart to remove_duplicate_reactions. Returns duplicate groups instead of mutating the model. Ignores bounds / GPR / objective — only stoichiometry is compared, mirroring the typical curation use case ("find reactions that could be merged"). A new ``ignore_direction=True`` default (yeast-GEM convention) treats A→B and B→A as duplicates. Set False to require identical orientation. Used by yeast-GEM's modelTests port (Tier 3 / phase 5) to flag duplicate reactions during curation review.

Generic batch curation engine extracted from yeast-GEM's MATLAB curateMetsRxnsGenes (yeast-GEM phase 6). Adds or updates metabolites, reactions and genes from pandas DataFrames; a batch_curate_from_tsv convenience wrapper reads the equivalent TSVs. Schema (matches yeast-GEM's data/modelCuration/template/ layout): mets_df metNames, comps, formula, charge, inchi, metNotes + any number of MIRIAM-namespace columns genes_df genes, geneShortNames + MIRIAM columns rxns_df rxnNames, grRules, lb, ub, rev, subSystems, eccodes, rxnNotes, rxnReferences, rxnConfidenceScores + MIRIAM columns rxns_coeffs_df rxnNames, metNames, comps, coefficient (one row per (reaction, metabolite) pair) Match keys: metabolites — (name, compartment) tuple genes — gene id reactions — stoichiometric signature Existing entities get their annotations overwritten (warning emitted); new entities are added with fresh ids generated from the supplied ``met_id_prefix`` / ``rxn_id_prefix`` (defaults M_ / R_ per the BiGG convention; yeast-GEM passes s_ / r_). Width of the existing zero-padded suffix is preserved so s_0001 → s_0002, not s_2. "Everything after the core columns is MIRIAM" — the header of any extra column becomes the annotation namespace key. Matches MATLAB behaviour exactly so yeast-GEM's existing TSVs work unchanged, and projects with different MIRIAM column sets need no code change. CurationResult dataclass records what was added vs updated so callers can verify in tests / CI. Tests: 13 new (add/update mets, add/update genes, add/update rxns by stoichiometry, miriam auto-detect, id-width preservation, combined mets+rxns in one call, missing-metabolite error, batch_curate_from_tsv round trip).

Three things this fixes: 1. write_yaml_model dropped the !!omap tags entirely. _to_plain was flattening cobra's OrderedDict to plain dict, which causes ruamel to emit ordinary block mappings. RAVEN MATLAB's reader is a line-based parser keyed on !!omap and therefore could not load any file we wrote. _to_plain now returns OrderedDict so ruamel re-emits the !!omap tag. 2. eccodes was lost on round-trip — it wasn't in _RXN_FIELDS, so read_yaml_model didn't capture it into .notes and write_yaml_model couldn't lift it back. Added. 3. RAVEN MATLAB writes reaction notes as 'rxnNotes'; cobrapy and this writer use 'notes'. Added a read-time alias so existing yeast-GEM YAML files (which still say 'rxnNotes') load cleanly. Writes go out as 'notes' (cobrapy-canonical). Top-level layout now matches RAVEN MATLAB: metaData first, then metabolites / reactions / genes / compartments, then optional gecko_light + ec-rxns + ec-enzymes. id/name/version live inside metaData (RAVEN convention) — cobrapy reading these files still works, but cobra_model.id ends up None because cobrapy doesn't know about metaData. raven_python.read_yaml_model lifts both metaData.id/name/version onto model.id / model.name / model.notes['version'] so the rest of the codebase doesn't care which layout the file used. Empty-name genes are no longer emitted as — that's a cobrapy quirk that drifts yeast-GEM YAML files away from RAVEN MATLAB's output. Verified end-to-end: * cobra.io.load_yaml_model reads every file the new writer produces (yeast-GEM and a synthetic fixture). * RAVEN MATLAB readYAMLmodel reads every file the new writer produces. * Round-tripping yeast-GEM through raven_python preserves 2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes, 2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG, 1788 SMILES, 1443 rxn-notes — no semantic drift. Tests ----- * tests/test_io_yaml_parity.py is new: covers every RAVEN extension, the rxnNotes legacy alias, the SMILES YAML-special character case, metaData-first layout, and cobra readability. * tests/test_io_yaml.py::test_output_is_cobra_readable adjusts for the metaData layout (cobra recovers mets/rxns/annotation but not model.id, by design).

PyYAML is not a project dependency; raven-python uses ruamel.yaml (already pulled in via cobra) everywhere else. The conditions module and its tests still imported PyYAML, which broke pytest collection on clean CI runners with 'No module named yaml'. Both apply.py and the test now use a YAML(typ='safe') instance from ruamel.yaml — same plain-dict semantics as PyYAML's safe_load / safe_dump, no new dependency.

Adds docs/reference/yaml_format.md as the canonical schema reference for the cross-toolchain YAML format (cobrapy / raven-python / RAVEN MATLAB). Covers the full document shape, per-entry field order, RAVEN extensions, the GECKO ec-* sections, the metaData provenance block, number / string / quoting rules, and the cross-reader interoperability matrix. Linked from docs/reference/index.md and the I/O guide. Reader fix: pre-shim RAVEN MATLAB writes emitted GECKO models with geckoLight: "true" inside the metaData block (not as a top-level gecko_light). The reader now lifts that legacy key out of metaData so model.ec.gecko_light is populated whichever placement the file used. Round-trip writes always use the new top-level form. Regression tests: test_pre_shim_format_loads — synthetic fixture covering every legacy quirk we know about (--- doc marker, plain metaData, geckoLight inside metaData, top-level metabolite smiles, rxnNotes reaction key, integer bounds, double-quoted strings). Each quirk has its own assertion + comment. test_pre_shim_yeast_gem_loads_if_available — sanity-loads the real yeast-GEM.yml (2748 mets, 4102 rxns, 1143 genes) and asserts the documented preserved-counts table from the format reference. Skipped on CI runners where the working copy isn't mounted.

No behaviour change on well-formed inputs. Highlights: - Packaging: derive __version__ from package metadata (was a stale hard-coded "0.0.1" that the docs site reported); pin ruff==0.15.15 in the dev extra and CI; fix two lint errors unpinned ruff started flagging. - Errors: solver/feasibility failures in run_init, run_ftinit, fill_tasks and random_sampling now raise cobra.exceptions.OptimizationError instead of bare RuntimeError (consistent with the rest of the package). - Consistency: single utils.parse.subsystem_to_str coerces reaction subsystem to cobra's canonical str across io.excel / comparison.compare / curation.batch / manipulation.add (fixes a crash on non-string items and the silent drop of multi-subsystem reactions); shared GPR score aggregators in utils.gpr used by init.score and init.genes; KEGG-download progress uses a module logger instead of print. - Robustness: zip path-traversal guard in binaries.py; penalty>0 check in connect_blocked_reactions; NaN-sample guard in random_sampling; all-zero ec coupling warning; optional verify= SHA256 re-check on data cache hits; non-finite z-score guard in reporter. Regression tests added for each.

edkerk added 9 commits May 31, 2026 01:03

Merge branch 'develop' into fix/cobra-aligned-hardening

f24b45c

edkerk merged commit f44d295 into develop Jun 8, 2026
5 checks passed

edkerk deleted the fix/cobra-aligned-hardening branch June 8, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cobra-aligned hardening pass from full code review#18

Cobra-aligned hardening pass from full code review#18
edkerk merged 9 commits into
developfrom
fix/cobra-aligned-hardening

edkerk commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edkerk commented Jun 7, 2026

What

Base / dependency

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant