Harden curation, EC-data and archive-handling modules#23
Merged
Conversation
Tier-2 audit of the post-review modules surfaced four targeted fixes:
- curation/batch.py: new reactions coerce a list-valued subSystems via
subsystem_to_str (";"-joined) instead of str(list), matching the update path.
- io/ec_data.py: _eccodes_to_yaml strips stray separators in the single-EC
case so a trailing ";" never leaks into the written YAML.
- binaries.py: _safe_extract_zip rejects symlink members, defence-in-depth
alongside the existing path-traversal guard.
- binaries.py / data.py: archive and dataset downloads pass a socket timeout
to urlopen so a stalled server cannot hang the process.
Adds regression tests for each fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tier-2 of the raven-python review: a correctness/security audit of the modules that
post-dated the prior review —
curation/batch.py,io/ec_data.py, and thedownload/extract pair
binaries.py/data.py. Four verified, targeted fixes; nobehaviour change on well-formed inputs.
Fixes
curation/batch.py— when building a new reaction, a list-valuedsubSystems(a reaction can belong to several) was rendered with
str(list)— a bracketed Pythonrepr — whereas the update path already used
subsystem_to_str(;-joined). Newreactions now use
subsystem_to_strtoo, so the two paths agree.io/ec_data.py—_eccodes_to_yamlreturned its raw input for the ≤1-EC case, soa stray trailing
;(or a lone separator) leaked into the written YAML. It now returnsthe cleaned
";".join(parts).binaries.py—_safe_extract_zipguarded../ absolute paths but not symlinkmembers, whose target the path check never validates. It now rejects symlink members
(defence-in-depth; SHA256 + HTTPS already gate the archive).
binaries.py/data.py— the downloadurlopencalls had no timeout, so astalled server could hang the process indefinitely. Both now pass a socket timeout.
Tests
One regression test per fix: list-valued
subSystemsjoin,_eccodes_to_yamledgecases, ZIP symlink rejection, and a
urlopen-timeout spy.Deliberately not changed
The audit also flagged duplicate
rxnNamesas a silent overwrite. On inspection,by_signatureis built once before the apply loop and not mutated during it, so thatpath either creates distinct reactions or hits the existing "overwritten" warning — not a
silent loss. Left as-is.
Verification
pytest(622 passed),ruff check, andmypy(ad-hoc, on the changed sources) allgreen. CI mypy enforcement arrives separately with #22.