Skip to content

Stream NETMHCSTABANDPAN per sample instead of waiting on full SV channel#247

Open
johnoooh wants to merge 2 commits into
developfrom
feature/sv-channel-padding
Open

Stream NETMHCSTABANDPAN per sample instead of waiting on full SV channel#247
johnoooh wants to merge 2 commits into
developfrom
feature/sv-channel-padding

Conversation

@johnoooh
Copy link
Copy Markdown
Collaborator

Problem

PR #246 added remainder: true to the SV-fasta join in
NETMHCSTABANDPAN.createNETMHCInput so samples without SV data wouldn't
be silently dropped. The fix was correct, but remainder: true buffers
each unmatched item until the right-side channel closes — which
doesn't happen until GENERATE_MUTATED_PEPTIDES finishes emitting,
which is gated on the slowest GENERATEMUTFASTA across the whole
cohort.

Visible symptom in neoantigenpipeline: no sample's netMHC tasks start
until every sample's GENERATEMUTFASTA has completed.

Fix

Move the "keep samples without SV" responsibility upstream into
GENERATE_MUTATED_PEPTIDES:

  • Branch ch_sv into with_sv (runs NEOSV as before) and without_sv
    (emits [meta, []] placeholders).
  • Mix both into sv_mut_fasta / sv_wt_fasta so each sample has exactly
    one entry on the SV outputs.

NETMHCSTABANDPAN.createNETMHCInput can then go back to a plain inner
.join(sv_fastas_channel, by:0). Every sample matches; items flow
per-sample as soon as their fastas are ready.

Verification

Ran neoantigenpipeline (-profile test,docker) with a 2-sample sheet
(one full-size MAF, one 1-variant MAF) before and after the fix and
compared task timestamps from .nextflow.log:

Run sample_tiny GENERATEMUTFASTA done First NETMHC submitted Gap
Baseline (remainder: true) 14:18:46.074 14:19:00.076 ~14.0 s
Fixed 13:43:43.426 13:43:43.549 ~0.12 s

The 14 s gap in baseline maps almost exactly to the runtime of the
other sample's NETMHC3 task — that's the cross-sample serialization
this PR removes.

The existing "empty SV channel" regression test from #246 still passes
— the subworkflow continues to tolerate an empty SV channel as a
defensive contract, even though callers are now expected to pre-pad.

Followup

Once merged, neoantigenpipeline needs the usual modules.json bump +
re-sync of the two subworkflow files (same shape as commit 257659f in
the pipeline that pulled in #246).

  • This comment contains a description of changes (with reason).
  • Check to see if a nf-core module, or subworkflow is available and usable for your pipeline.
  • Feature branch is named feature/<module_name> for modules, or feature/<subworkflow_name> for subworkflows. For modules, if there is a subcommand use: feature/<module_name>/<module_subcommand>.
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs.
  • Use nf-core data if possible for nf-tests. If not, use or add test data to mskcc-omics-workflows/test-datasets, following the repository guidelines, for nf-tests. Finally, if neither option is feasible, only add a stub nf-test.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label.
  • Use Jfrog if possible to fulfill software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <module_branch> test <MODULE> --profile docker
      • nf-core modules --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <module_branch> test <MODULE> --profile singularity
      • nf-core modules --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <module_branch> test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <subworkflow_branch> test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <subworkflow_branch> test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <subworkflow_branch> test <SUBWORKFLOW> --profile conda

PR #246 added remainder: true to the SV-fasta join in
NETMHCSTABANDPAN.createNETMHCInput so samples without SV data wouldn't
be silently dropped. The fix is correct, but remainder: true buffers
each unmatched item until the SV channel closes — which doesn't happen
until GENERATE_MUTATED_PEPTIDES finishes emitting (gated on the slowest
GENERATEMUTFASTA across the whole cohort). The visible symptom is that
no sample's netMHC tasks start until every sample's GENERATEMUTFASTA
has completed.

Move the "keep samples without SV" responsibility upstream into
GENERATE_MUTATED_PEPTIDES: branch ch_sv into with_sv (runs NEOSV) and
without_sv (emits [meta, []] placeholders), and mix both into
sv_mut_fasta/sv_wt_fasta. NETMHCSTABANDPAN can then use a plain inner
.join — every sample matches and items flow per-sample as soon as
their fastas are ready.

Verified end-to-end against neoantigenpipeline with a 2-sample run:
sample_tiny's first NETMHC submission lands ~120 ms after its own
GENERATEMUTFASTA completes, vs ~14 s previously (where it sat behind
the other sample's netMHC task). The existing "empty SV channel"
regression test from PR #246 still passes — the subworkflow continues
to tolerate an empty SV channel as a defensive contract.
@johnoooh johnoooh requested review from a team and nikhil as code owners May 27, 2026 17:54
The "empty SV channel" stub test from PR #246 was failing with NPE on
this branch: with remainder: true removed, channel.empty() produces zero
join matches and workflow.out.tsv is empty, so the snapshot assertion
indexed into null.

Under the new contract callers (e.g. GENERATE_MUTATED_PEPTIDES) pre-pad
sv_fastas with [meta, [], []] for samples without SV data — channel.empty()
is no longer a realistic input shape. Rename the test to reflect the new
contract and pass a padded placeholder; the assertion now covers all 4
emitted tsv outputs (MUT/WT × PAN/STAB).

Verified locally against an isolated copy of the test in stub mode:
PASSED with the expected 4 output filenames. Snapshot crafted to match
the structurally identical existing "netmhcstabandpan - fa,hla_str - tsv
- stub" test, since stub mode produces identical meta and filenames
regardless of SV input shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants