perf(bb/msm): WebGPU MSM memory wins — pack chunks+signs, plan-ring, reduction alias, bufA stride, budget=180#23532
Draft
AztecBot wants to merge 2 commits into
Draft
perf(bb/msm): WebGPU MSM memory wins — pack chunks+signs, plan-ring, reduction alias, bufA stride, budget=180#23532AztecBot wants to merge 2 commits into
AztecBot wants to merge 2 commits into
Conversation
…ing ping-pong, alias reduction buffers
The pair-tree halves per-window active-sum count at each level, so the odd-level outputs that land in bufA are roughly half the width of the even-level outputs that land in bufB. Previously both buffers were sized to the wider bufB stride. Splitting bufA's stride from bufB's (M1_A = batchWindows × wstride_oddOut, M1_B = batchWindows × wstride_evenOut) and pushing per-level (M_in, M_out) pairs through the planner / fused / carry / finalize / pad uniforms shrinks the active-sums footprint without changing the WGSL. In isolation the picker would claw the savings back by collapsing to numBatches=1, so the lever-G budget also drops 248→180 MiB. The budget is the per-batch ceiling on the weakest mobile target; at logN=17 c=15 it keeps numBatches≥2 (9 windows/batch). No (n, c) we measure falls below 4 windows/batch. logN=17 c=15 on macOS Sequoia / Chrome 148 (M2 reference desktop): - baseline: 130.1 MiB, 66.3 ms, cross_ok - + step 5 + budget=180: 120.2 MiB, 68.2 ms, cross_ok
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four memory-reducing refactors on top of
zw/msm-webgpu-experiments-v2. Together they take the GPU storage footprint atlogN=17, c=15from 149.3 MiB → 120.2 MiB (−29.1 MiB / −19%) without runtime regression on M2 (Apple Silicon). Cross-check against WASM Pippenger passes.1. Pack
chunksBuf+signsBufinto one u32decompose_scalars_boothalready produces abucket(≤ 2^14 forc=15, fits in 15 bits) and a 1-bitnegfor every(point, window)slot. Until now those went to two separatearray<u32>storage buffers of sizebatchSlots × 4each. Combined them:with the three downstream readers each pulling the field they need:
transpose_count_tiled/transpose_scatter_tiled:let col = all_csr_col_idx[i] & 0x7fffu;csr_to_v2_active_sums(both INDEX_MODE and non-INDEX_MODE):let neg = (signs[...] >> 15u) & 1u;Host drops
signsBufentirely and removes thesignsbinding from the decompose layout (4 entries instead of 5). Thesignssymbol still appears incsr_to_v2_active_sums(now bound to the packedchunksBufsincesignsBuf = chunksBufin the host).Saves
4 × batchSlotsbytes per MSM (≈ 6.3 MiB at N=2^18 c=15, batchWindows=6).Important debug note: WGSL uniform-controlled shifts are miscompiled on at least Apple Silicon + Adreno
First attempt used
bucket | (neg << c)wherec = params.z(au32uniform with value 15). Cross-check produced wrong, non-deterministic results on both M2 and S25, despite the encoding being semantically identical. Changing to the constantbucket | (neg << 15u)— same value, same bit position — made both devices pass. That's a real toolchain issue worth filing upstream against Dawn / Tint; for now this PR sticks to constant shift amounts.2. Drop the plan-ring ping-pong
chunkPlanRing/scatterPlanRing/carryPlanRingwere each allocated as 2-buffer rings indexed bylv & 1. They are written byplannerBand read by the same level'sfused/carry; each level's WebGPU compute pass ends (with the implicit pass barrier) before the next level'splannerBwrites. No cross-level read/write race exists — the ping-pong is unnecessary.Collapsed each ring to a single buffer (
chunkPlanRing.push(cp, cp); scatterPlanRing.push(sp, sp); carryPlanRing.push(yp, yp);) so existing[ring]indexing keeps working but the three duplicate allocations vanish.Saves
chunkPlanRing[1] + scatterPlanRing[1] + carryPlanRing[1](≈ 10 MiB at N=2^18, less at smaller N).Note:
countsBufs[0/1]andoffsetsBufs[0/1]are NOT collapsed —plannerAdoes in-place read ofcountsBufs[inIdx]while writingcountsBufs[outIdx], so collapsing those would race within the same dispatch.3. Alias reduction-only buffers into batch-loop buffers
redBuf/isPresentBuf/reducePrefScratchare only live during reduction, which runs strictly after the outer batch loop completes.bufA/valIdxBuf/bufBare live during the batch loop but dead by the time reduction runs. Aliased the reduction buffers as offset-0 slices of the batch-loop buffers via{ buffer, offset, size }bindings — same underlying GPU allocation, two non-overlapping logical lifetimes.Sizes verified at prepare time:
bufA.size >= 64·RED_M(= 17.8 MiB at N=2^18)valIdxBuf.size >= 4·RED_M(= 1.1 MiB)bufB.size >= NUM_WINDOWS·REDUCE_WG·MAXC·2·16(= 8.9 MiB)Saves 3 separate allocations (
redBuf+isPresentBuf+reducePrefScratch) entirely.4. Tighten
bufAstride + lower MEM_BUDGET 248→180 MiBThe pair-tree halves per-window active-sum count at each level, so odd-level outputs (which land in
bufA) are roughly half the width of even-level outputs (which land inbufB). The prior code sized both buffers to the largerwstride1—bufAwas effectively wasting ~25-40 % of its allocation.Split the strides:
and pushed per-level
(M_in, M_out)pairs through the planner / fused / carry / finalize / pad uniforms (ba_fused_super_benchandba_carry_copy_benchalready accept distinctM_old/M_new; the change is host-only).padParamsnow has three variants — L0 (bufB-out), BA (bufB→bufA), AB (bufA→bufB) — selected by output parity.In isolation the picker would claw the savings back by collapsing to
numBatches=1, soMEM_BUDGETalso drops 248 → 180 MiB. The budget is the per-batch ceiling for the weakest mobile target; atlogN=17 c=15it keepsnumBatches ≥ 2(9 windows/batch). No(n, c)we measure falls below 4 windows/batch.Saves ~10 MiB at logN=17 c=15; larger at logN=18 (lever scales with bufA's true working set).
Diagnostics also included
Small things that are useful to keep:
__msm_mem_lastwindow global +console.logat end ofMsmV2.preparereportingprepBuffers.length,totalBytes,numBatches,batchWindows,M1_A,M1_B— captured into thememfield ofautorun=msm-cross-checkJSONL output so per-step memory accounting is grep-able from the bench harness.coi=1%26autorun%3D...URL-unpacking helper indev/msm-webgpu/main.tsso BrowserStack mobile sessions (which truncate at the first literal&in the URL) can pass autorun + logn params through thecoivalue.dev/msm-webgpu/scripts/run-browserstack.mjsdocumenting the project policy: when validating a memory change on this branch, dispatch one S25--n 17BrowserStack job; M2 / Pixel cross-references just burn wall clock without adding signal.Measured at logN=17, c=15 on fresh BrowserStack workers
0999593b2a6)S25 swing on step 3 (+5%) is within the BS-S25 per-run variance for this workload; final S25 number on step 4 (pack) lands back at parity. S25 validation on the bufA-stride commit is pending — two consecutive BS S25 workers came back with zero
/progressevents in 15 min while the M2 desktop worker on the same tunnel URL passed in ~3 min. Same BS-S25 routing flakiness Zac flagged earlier on this branch; the code path itself is exercised atnumBatches=2on M2 withcross_ok=true, so the logic is sound. Anyone with a real S25 can re-bench against the tip; the dev-page autorun captures the headline directly.Out of scope (planned follow-ups)
These are real follow-up wins but were deferred:
prefScratchBuf(≈ 17 MiB at N=2^18). Mechanically straightforward, but at WGI=128 × S=8 the per-workgroup shared footprint hits the 32 KiBmaxComputeWorkgroupStorageSizeceiling, forcing 1 workgroup per SM on Adreno and regressing S25 runtime by ≈ 23 %. Needs S=4 or a hybrid layout to bring shared usage under the budget.ba_fused_super_benchso the intermediate level output never lands in global memory. Needs the planner topology to align thread fragments across the two fused levels (level L+1 pairings inside a thread's S/2 register outputs); non-trivial.Created by claudebox · group:
slackbot