[QDP] [feature] Pr4 kronecker fwt#1391
Open
aloha1357 wants to merge 6 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
related #1385
Changes
Why
For large qubit counts ($N > 12$ ), the standard Fast Walsh-Hadamard Transform (FWT) algorithm becomes severely bound by Global Memory bandwidth. The FWT requires $\log_2(N)$ stages of in-place memory access across the entire $2^N$ state vector, which causes cache thrashing and massive DRAM roundtrips.
To overcome this, we mathematically restructure the FWT into a Kronecker Product Decomposition:$H_n = H_{n/2} \otimes H_{n/2}$ . This transforms the sparse, memory-bound butterfly operations into standard, dense matrix multiplications (GEMM) using a Blocked architecture.
While the implicit Hadamard engine is not yet introduced (coming in PR 5), this PR establishes the structural memory layout, allocation, and transpose logic necessary for the decomposition.
How
launch_iqp_encode_tcto dynamically split the state vector into two dimensions (d_temp_real,d_temp_imag) to store the transposed matrix blocks during the 4-step blocked algorithm.naive_implicit_hadamard_gemm_kernelas a fallback structural placeholder. It calculates the Hadamard values on-the-fly ($\text{popc}(k \ &\ i)$) and executes the block multiplicationiqp_tc_batch_transpose_kernel(introduced in PR 2) to transpose the blocks between the two GEMM stages, achieving the mathematical equivalent of theChecklist