Fix per-batch clip threshold in pearson_residuals HVG#4138
Open
Arshammik wants to merge 2 commits into
Open
Conversation
…esiduals `_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter from `None` to `sqrt(n_batch_0)` on the first batch iteration, which made the `if clip is None:` check False on every subsequent batch. The first batch's clip was therefore reused for every other batch, even though Lause-Berens-Kobak (2021) specifies a per-batch threshold of `sqrt(n_cells_in_batch)`. For an unbalanced split — e.g. two batches of 5000 and 200 cells — the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating per-gene residual variance and shifting HVG ranking. The bug also made the result depend on alphabetical batch-label ordering (`np.unique` picks the first label), which is a reproducibility hazard. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_highly_variable_genes.py` cover the regression: one instruments `_calculate_res_sparse` to capture the per-batch clip value, the other verifies that swapping batch labels A<->B on identical data produces identical `residual_variances`.
for more information, see https://pre-commit.ci
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
In
_highly_variable_pearson_residuals, theclipparameter is reassigned inside the per-batch loop when it comes in asNone:if clip is None: clip = np.sqrt(n). After the first iteration,clipis no longerNone, so every subsequent batch silently reuses the first batch's threshold instead of computing its ownsqrt(n_batch).This matters because Lause, Berens & Kobak (2021) define the clip as
sqrt(N)over the cells entering the residual computation, and withbatch_keyeach batch is its own residual computation. For a 5000/200 split, the small batch getssqrt(5000) ≈ 70.7instead ofsqrt(200) ≈ 14.1, so residuals that should be clipped in the small batch are not, its per-gene variances are inflated, and HVG ranking shifts. There is also a quieter reproducibility consequence: becausenp.uniquesorts the labels, the threshold that leaks into all batches is whichever one is alphabetically first, so renaming "A"↔"B" on the same data changes the HVG output.The fix is a one-line scoping change — a loop-local
clip_batchthat is recomputed each iteration from the current batch'sn. The user-facingclipargument is untouched, so any caller that passed an explicit value still gets exactly that value applied to every batch.Two tests in
tests/test_highly_variable_genes.pycover this. The first monkeypatches_calculate_res_sparseto capture theclipvalue passed on each batch and asserts two distinct values appear for a 5000/200 split; the second swaps batch labels A↔B on identical data and assertsresidual_variancesis unchanged. Both fail onmainand pass on this branch, and the rest of the pearson-residual suite is green.rapids-singlecell ports this routine verbatim and inherits the same bug; I have a parallel fix queued there that will reference this PR number once assigned.
Thanks,
Arsham