Skip to content

Harden curation, EC-data and archive-handling modules#23

Merged
edkerk merged 1 commit into
developfrom
fix/harden-new-modules
Jun 10, 2026
Merged

Harden curation, EC-data and archive-handling modules#23
edkerk merged 1 commit into
developfrom
fix/harden-new-modules

Conversation

@edkerk

@edkerk edkerk commented Jun 9, 2026

Copy link
Copy Markdown
Member

Tier-2 of the raven-python review: a correctness/security audit of the modules that
post-dated the prior review — curation/batch.py, io/ec_data.py, and the
download/extract pair binaries.py / data.py. Four verified, targeted fixes; no
behaviour change on well-formed inputs.

Fixes

  • curation/batch.py — when building a new reaction, a list-valued subSystems
    (a reaction can belong to several) was rendered with str(list) — a bracketed Python
    repr — whereas the update path already used subsystem_to_str (;-joined). New
    reactions now use subsystem_to_str too, so the two paths agree.
  • io/ec_data.py_eccodes_to_yaml returned its raw input for the ≤1-EC case, so
    a stray trailing ; (or a lone separator) leaked into the written YAML. It now returns
    the cleaned ";".join(parts).
  • binaries.py_safe_extract_zip guarded .. / absolute paths but not symlink
    members, whose target the path check never validates. It now rejects symlink members
    (defence-in-depth; SHA256 + HTTPS already gate the archive).
  • binaries.py / data.py — the download urlopen calls had no timeout, so a
    stalled server could hang the process indefinitely. Both now pass a socket timeout.

Tests

One regression test per fix: list-valued subSystems join, _eccodes_to_yaml edge
cases, ZIP symlink rejection, and a urlopen-timeout spy.

Deliberately not changed

The audit also flagged duplicate rxnNames as a silent overwrite. On inspection,
by_signature is built once before the apply loop and not mutated during it, so that
path either creates distinct reactions or hits the existing "overwritten" warning — not a
silent loss. Left as-is.

Verification

pytest (622 passed), ruff check, and mypy (ad-hoc, on the changed sources) all
green. CI mypy enforcement arrives separately with #22.

Tier-2 audit of the post-review modules surfaced four targeted fixes:

- curation/batch.py: new reactions coerce a list-valued subSystems via
  subsystem_to_str (";"-joined) instead of str(list), matching the update path.
- io/ec_data.py: _eccodes_to_yaml strips stray separators in the single-EC
  case so a trailing ";" never leaks into the written YAML.
- binaries.py: _safe_extract_zip rejects symlink members, defence-in-depth
  alongside the existing path-traversal guard.
- binaries.py / data.py: archive and dataset downloads pass a socket timeout
  to urlopen so a stalled server cannot hang the process.

Adds regression tests for each fix.
@edkerk edkerk merged commit 686e0ed into develop Jun 10, 2026
5 checks passed
@edkerk edkerk deleted the fix/harden-new-modules branch June 10, 2026 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant