Stream NETMHCSTABANDPAN per sample instead of waiting on full SV channel#247
Open
johnoooh wants to merge 2 commits into
Open
Stream NETMHCSTABANDPAN per sample instead of waiting on full SV channel#247johnoooh wants to merge 2 commits into
johnoooh wants to merge 2 commits into
Conversation
PR #246 added remainder: true to the SV-fasta join in NETMHCSTABANDPAN.createNETMHCInput so samples without SV data wouldn't be silently dropped. The fix is correct, but remainder: true buffers each unmatched item until the SV channel closes — which doesn't happen until GENERATE_MUTATED_PEPTIDES finishes emitting (gated on the slowest GENERATEMUTFASTA across the whole cohort). The visible symptom is that no sample's netMHC tasks start until every sample's GENERATEMUTFASTA has completed. Move the "keep samples without SV" responsibility upstream into GENERATE_MUTATED_PEPTIDES: branch ch_sv into with_sv (runs NEOSV) and without_sv (emits [meta, []] placeholders), and mix both into sv_mut_fasta/sv_wt_fasta. NETMHCSTABANDPAN can then use a plain inner .join — every sample matches and items flow per-sample as soon as their fastas are ready. Verified end-to-end against neoantigenpipeline with a 2-sample run: sample_tiny's first NETMHC submission lands ~120 ms after its own GENERATEMUTFASTA completes, vs ~14 s previously (where it sat behind the other sample's netMHC task). The existing "empty SV channel" regression test from PR #246 still passes — the subworkflow continues to tolerate an empty SV channel as a defensive contract.
The "empty SV channel" stub test from PR #246 was failing with NPE on this branch: with remainder: true removed, channel.empty() produces zero join matches and workflow.out.tsv is empty, so the snapshot assertion indexed into null. Under the new contract callers (e.g. GENERATE_MUTATED_PEPTIDES) pre-pad sv_fastas with [meta, [], []] for samples without SV data — channel.empty() is no longer a realistic input shape. Rename the test to reflect the new contract and pass a padded placeholder; the assertion now covers all 4 emitted tsv outputs (MUT/WT × PAN/STAB). Verified locally against an isolated copy of the test in stub mode: PASSED with the expected 4 output filenames. Snapshot crafted to match the structurally identical existing "netmhcstabandpan - fa,hla_str - tsv - stub" test, since stub mode produces identical meta and filenames regardless of SV input shape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PR #246 added
remainder: trueto the SV-fasta join inNETMHCSTABANDPAN.createNETMHCInputso samples without SV data wouldn'tbe silently dropped. The fix was correct, but
remainder: truebufferseach unmatched item until the right-side channel closes — which
doesn't happen until
GENERATE_MUTATED_PEPTIDESfinishes emitting,which is gated on the slowest
GENERATEMUTFASTAacross the wholecohort.
Visible symptom in neoantigenpipeline: no sample's netMHC tasks start
until every sample's
GENERATEMUTFASTAhas completed.Fix
Move the "keep samples without SV" responsibility upstream into
GENERATE_MUTATED_PEPTIDES:ch_svintowith_sv(runsNEOSVas before) andwithout_sv(emits
[meta, []]placeholders).sv_mut_fasta/sv_wt_fastaso each sample has exactlyone entry on the SV outputs.
NETMHCSTABANDPAN.createNETMHCInputcan then go back to a plain inner.join(sv_fastas_channel, by:0). Every sample matches; items flowper-sample as soon as their fastas are ready.
Verification
Ran neoantigenpipeline (
-profile test,docker) with a 2-sample sheet(one full-size MAF, one 1-variant MAF) before and after the fix and
compared task timestamps from
.nextflow.log:remainder: true)The 14 s gap in baseline maps almost exactly to the runtime of the
other sample's
NETMHC3task — that's the cross-sample serializationthis PR removes.
The existing "empty SV channel" regression test from #246 still passes
— the subworkflow continues to tolerate an empty SV channel as a
defensive contract, even though callers are now expected to pre-pad.
Followup
Once merged, neoantigenpipeline needs the usual
modules.jsonbump +re-sync of the two subworkflow files (same shape as commit 257659f in
the pipeline that pulled in #246).
feature/<module_name>for modules, orfeature/<subworkflow_name>for subworkflows. For modules, if there is a subcommand use:feature/<module_name>/<module_subcommand>.versions.ymlfile.label.nf-core modules --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <module_branch> test <MODULE> --profile dockernf-core modules --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <module_branch> test <MODULE> --profile singularitynf-core modules --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <module_branch> test <MODULE> --profile condanf-core subworkflows --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <subworkflow_branch> test <SUBWORKFLOW> --profile dockernf-core subworkflows --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <subworkflow_branch> test <SUBWORKFLOW> --profile singularitynf-core subworkflows --git-remote https://github.com/mskcc-omics-workflows/modules.git -b <subworkflow_branch> test <SUBWORKFLOW> --profile conda