Skip to content

Fix per-batch clip threshold in pearson_residuals HVG#4138

Open
Arshammik wants to merge 2 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch
Open

Fix per-batch clip threshold in pearson_residuals HVG#4138
Arshammik wants to merge 2 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch

Conversation

@Arshammik
Copy link
Copy Markdown

Hi,

In _highly_variable_pearson_residuals, the clip parameter is reassigned inside the per-batch loop when it comes in as None: if clip is None: clip = np.sqrt(n). After the first iteration, clip is no longer None, so every subsequent batch silently reuses the first batch's threshold instead of computing its own sqrt(n_batch).

This matters because Lause, Berens & Kobak (2021) define the clip as sqrt(N) over the cells entering the residual computation, and with batch_key each batch is its own residual computation. For a 5000/200 split, the small batch gets sqrt(5000) ≈ 70.7 instead of sqrt(200) ≈ 14.1, so residuals that should be clipped in the small batch are not, its per-gene variances are inflated, and HVG ranking shifts. There is also a quieter reproducibility consequence: because np.unique sorts the labels, the threshold that leaks into all batches is whichever one is alphabetically first, so renaming "A"↔"B" on the same data changes the HVG output.

The fix is a one-line scoping change — a loop-local clip_batch that is recomputed each iteration from the current batch's n. The user-facing clip argument is untouched, so any caller that passed an explicit value still gets exactly that value applied to every batch.

Two tests in tests/test_highly_variable_genes.py cover this. The first monkeypatches _calculate_res_sparse to capture the clip value passed on each batch and asserts two distinct values appear for a 5000/200 split; the second swaps batch labels A↔B on identical data and asserts residual_variances is unchanged. Both fail on main and pass on this branch, and the rest of the pearson-residual suite is green.

rapids-singlecell ports this routine verbatim and inherits the same bug; I have a parallel fix queued there that will reference this PR number once assigned.

Thanks,
Arsham

Arshammik and others added 2 commits May 23, 2026 13:25
…esiduals

`_highly_variable_pearson_residuals` reassigned the function-scoped `clip`
parameter from `None` to `sqrt(n_batch_0)` on the first batch iteration,
which made the `if clip is None:` check False on every subsequent batch.
The first batch's clip was therefore reused for every other batch, even
though Lause-Berens-Kobak (2021) specifies a per-batch threshold of
`sqrt(n_cells_in_batch)`.

For an unbalanced split — e.g. two batches of 5000 and 200 cells — the
small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its
own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be
clipped were not, inflating per-gene residual variance and shifting HVG
ranking. The bug also made the result depend on alphabetical batch-label
ordering (`np.unique` picks the first label), which is a reproducibility
hazard.

The fix introduces a loop-local `clip_batch` variable so the per-batch
default is recomputed each iteration. The user-facing `clip` parameter is
untouched.

Two new tests in `tests/test_highly_variable_genes.py` cover the
regression: one instruments `_calculate_res_sparse` to capture the
per-batch clip value, the other verifies that swapping batch labels A<->B
on identical data produces identical `residual_variances`.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

⚠️ JUnit XML file not found

The CLI was unable to find any JUnit XML files to upload.
For more help, visit our troubleshooting guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant