[QDP] [feature] Pr6 tensor core acceleration#1389
Open
aloha1357 wants to merge 8 commits into
Open
Conversation
…C GEMM on non-Hadamard logic (PR6). Tests: wsl cargo test passed (0 failures). PR6 comments added.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
related #1385
Changes
Why
PR1–PR5 delivered the specialized
ImplicitHadamardOzakiEngine(matrix-free, +/-1 perfect quantization, Kronecker-blocked FWT for IQP). However, the full AdaptiveGEMM research engine (AdaptiveOzakiEngine) providing mixed-precision graded-ring (Ozaki + CRT over 7 primes, hybrid FP64/INT8 TC, Phase26 persistent kernels, general A @ B for arbitrary matrices) was only present in the final research snapshot and standalone pybind (adaptive_gemm_py).To finalize the pipeline, we must hook the general engine for "non-Hadamard logic" — i.e., any case where the second operand is not the special structured Hadamard matrix. This enables future general Tensor Core accelerated linear algebra inside QDP (beyond pure IQP FWT) while reusing the same Ozaki INT8 TC machinery.
How
AdaptiveOzaki.cu(full hybrid/persistent/general GEMM implementation) frompr-final-versioninto the clean PR chain.qdp/qdp-kernels/build.rsto compileAdaptiveOzaki.cualongside the Implicit path.launch_adaptive_ozaki_gemmC FFI entry point (wrapsAdaptiveOzakiEngine::executewith default Phase26Hybrid config) + matching declaration and no-cuda stub inlib.rs.// PR6:inline English comments on all changed sites.wsl -e bash -ic 'export PATH=/usr/local/cuda/bin:$PATH && cd .../qdp && cargo test --workspace --exclude qdp-python --lib'passes with 0 failures (builds the new CUDA symbols successfully).The new public kernel symbol
launch_adaptive_ozaki_gemmcan now be called from Rust (qdp-core) or exposed upward, providing the general non-Hadamard TC path that complementslaunch_iqp_encode_tc(Hadamard-specialized).Checklist