[QDP] [feature] Pr4 kronecker fwt by aloha1357 · Pull Request #1391 · apache/mahout

aloha1357 · 2026-06-07T19:34:06Z

Related Issues

related #1385

Changes

Why

For large qubit counts ($N > 12$), the standard Fast Walsh-Hadamard Transform (FWT) algorithm becomes severely bound by Global Memory bandwidth. The FWT requires $\log_2(N)$ stages of in-place memory access across the entire $2^N$ state vector, which causes cache thrashing and massive DRAM roundtrips.

To overcome this, we mathematically restructure the FWT into a Kronecker Product Decomposition: $H_n = H_{n/2} \otimes H_{n/2}$. This transforms the sparse, memory-bound butterfly operations into standard, dense matrix multiplications (GEMM) using a Blocked architecture.

While the implicit Hadamard engine is not yet introduced (coming in PR 5), this PR establishes the structural memory layout, allocation, and transpose logic necessary for the decomposition.

How

Kronecker Decomposition Logic: Updated launch_iqp_encode_tc to dynamically split the state vector into two dimensions ($n_1$ and $n_2$).
Intermediate Allocations: Added temporary memory allocations (d_temp_real, d_temp_imag) to store the transposed matrix blocks during the 4-step blocked algorithm.
Naive GEMM Placeholder: Introduced naive_implicit_hadamard_gemm_kernel as a fallback structural placeholder. It calculates the Hadamard values on-the-fly ($\text{popc}(k \ &\ i)$) and executes the block multiplication $Z = X \cdot H_{n2}$.
Matrix Layout Transform: Leveraged the iqp_tc_batch_transpose_kernel (introduced in PR 2) to transpose the blocks between the two GEMM stages, achieving the mathematical equivalent of the $O(N \log N)$ FWT through dense GEMMs.

Checklist

Added or updated unit tests for all changes (Verified passing against existing CI test suite)
Added or updated documentation for all changes (Added explanatory inline comments for PR)

…mputation

…ations

…tecture

aloha1357 added 6 commits June 7, 2026 18:33

feat(qdp): optimize phase kernel divergence and hoist constant mem co…

41c0f33

…mputation

style(qdp): add explanatory comments for phase and iqp kernel optimiz…

ca90282

…ations

feat(qdp): introduce batch throughput optimization scaffolding for TC

c7351cb

feat(qdp): introduce batch throughput optimization scaffolding for TC

60fda91

feat(qdp): introduce shared memory fused FWT for small qubit counts

3060ab9

feat(qdp): restructure FWT into Kronecker decomposition blocked archi…

38ee656

…tecture

aloha1357 requested review from 400Ping, guan404ming and ryankert01 as code owners June 7, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] [feature] Pr4 kronecker fwt#1391

[QDP] [feature] Pr4 kronecker fwt#1391
aloha1357 wants to merge 6 commits into
apache:mainfrom
aloha1357:pr4-kronecker-fwt

aloha1357 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aloha1357 commented Jun 7, 2026

Related Issues

Changes

Why

How

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant