Skip to content

[QDP] [feature] Pr6 tensor core acceleration#1389

Open
aloha1357 wants to merge 8 commits into
apache:mainfrom
aloha1357:pr6-tensor-core-acceleration
Open

[QDP] [feature] Pr6 tensor core acceleration#1389
aloha1357 wants to merge 8 commits into
apache:mainfrom
aloha1357:pr6-tensor-core-acceleration

Conversation

@aloha1357
Copy link
Copy Markdown

Related Issues

related #1385

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Why

PR1–PR5 delivered the specialized ImplicitHadamardOzakiEngine (matrix-free, +/-1 perfect quantization, Kronecker-blocked FWT for IQP). However, the full AdaptiveGEMM research engine (AdaptiveOzakiEngine) providing mixed-precision graded-ring (Ozaki + CRT over 7 primes, hybrid FP64/INT8 TC, Phase26 persistent kernels, general A @ B for arbitrary matrices) was only present in the final research snapshot and standalone pybind (adaptive_gemm_py).

To finalize the pipeline, we must hook the general engine for "non-Hadamard logic" — i.e., any case where the second operand is not the special structured Hadamard matrix. This enables future general Tensor Core accelerated linear algebra inside QDP (beyond pure IQP FWT) while reusing the same Ozaki INT8 TC machinery.

How

  • Git archaeology transplant of AdaptiveOzaki.cu (full hybrid/persistent/general GEMM implementation) from pr-final-version into the clean PR chain.
  • Updated qdp/qdp-kernels/build.rs to compile AdaptiveOzaki.cu alongside the Implicit path.
  • Added launch_adaptive_ozaki_gemm C FFI entry point (wraps AdaptiveOzakiEngine::execute with default Phase26Hybrid config) + matching declaration and no-cuda stub in lib.rs.
  • Added // PR6: inline English comments on all changed sites.
  • Verified end-to-end: wsl -e bash -ic 'export PATH=/usr/local/cuda/bin:$PATH && cd .../qdp && cargo test --workspace --exclude qdp-python --lib' passes with 0 failures (builds the new CUDA symbols successfully).

The new public kernel symbol launch_adaptive_ozaki_gemm can now be called from Rust (qdp-core) or exposed upward, providing the general non-Hadamard TC path that complements launch_iqp_encode_tc (Hadamard-specialized).

Checklist

  • Added or updated unit tests for all changes (Verified passing against existing CI test suite — the --lib tests exercise build + Rust wrappers; GPU execution of new path covered by existing bench harness)
  • Added or updated documentation for all changes (Added explanatory inline comments for PR; this PR06 doc on internal-dev-notes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant