perf(qdp): inline norm calculation in phase kernel loop for 5% speedup#1380
Open
botszhuang wants to merge 1 commit into
Open
perf(qdp): inline norm calculation in phase kernel loop for 5% speedup#1380botszhuang wants to merge 1 commit into
botszhuang wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the CUDA phase_encode_kernel in qdp/qdp-kernels/src/phase.cu by folding the normalization-factor computation into the existing per-qubit loop that accumulates the phase, aiming to reduce overhead and improve kernel throughput.
Changes:
- Inline
normaccumulation (norm *= M_SQRT1_2) inside thebitloop inphase_encode_kernel. - Remove the separate
phase_norm(num_qubits)call for the non-batch kernel path.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // φ(idx) = Σ_k phases[k] * b_k, b_k = (idx >> k) & 1 | ||
| double phi = 0.0; | ||
| double norm = 1.0 ; |
Comment on lines
+61
to
+62
|
|
||
| norm *= M_SQRT1_2; |
Member
|
Need to fix precommit and copilot's comments. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Relates to #1227
Changes
Why
This PR aims to strengthen the CUDA kernel implementation of
phase.cuby optimizing the norm calculation.Through PTX analysis and empirical benchmarks, I found that combining the norm multiplication into the existing loop yields a better performance improvement than using
pow()or solely focusing on branch divergence.How
Optimization Details & Insights: norm Calculation Optimization (The Real Winner 🚀)
pow(M_SQRT1_2, num_qubits)via SFU would speed up the normalization factor.pow()actually introduced overhead and slowed down the execution.(norm *= M_SQRT1_2;)directly into the loop alongside the phase accumulation achieved the best results. This allows the compiler to pipelining the instructions effectively.Benchmark Results
Checklist