Skip to content

perf(qdp): inline norm calculation in phase kernel loop for 5% speedup#1380

Open
botszhuang wants to merge 1 commit into
apache:mainfrom
botszhuang:perf/optimize-phase-kernel
Open

perf(qdp): inline norm calculation in phase kernel loop for 5% speedup#1380
botszhuang wants to merge 1 commit into
apache:mainfrom
botszhuang:perf/optimize-phase-kernel

Conversation

@botszhuang

Copy link
Copy Markdown

Related Issues

Relates to #1227

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Why

This PR aims to strengthen the CUDA kernel implementation of phase.cu by optimizing the norm calculation.

Through PTX analysis and empirical benchmarks, I found that combining the norm multiplication into the existing loop yields a better performance improvement than using pow() or solely focusing on branch divergence.

How

Optimization Details & Insights: norm Calculation Optimization (The Real Winner 🚀)

  • Hypothesis: Using pow(M_SQRT1_2, num_qubits) via SFU would speed up the normalization factor.
  • Benchmark Reality: pow() actually introduced overhead and slowed down the execution.
  • The Better Solution: Embedding the norm scaling (norm *= M_SQRT1_2;) directly into the loop alongside the phase accumulation achieved the best results. This allows the compiler to pipelining the instructions effectively.

Benchmark Results

  • Environment: RunPod GPU instance (CUDA 12.8, NVVM 7.0.1)
  • Configuration: Grid size: 2048, Block size: 512
Implementation Execution Time (ms) Speedup
Original 0.4434 ms Baseline
Inline Norm in Loop (This PR) 0.3995 ms ~5% Performance Gain

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the CUDA phase_encode_kernel in qdp/qdp-kernels/src/phase.cu by folding the normalization-factor computation into the existing per-qubit loop that accumulates the phase, aiming to reduce overhead and improve kernel throughput.

Changes:

  • Inline norm accumulation (norm *= M_SQRT1_2) inside the bit loop in phase_encode_kernel.
  • Remove the separate phase_norm(num_qubits) call for the non-batch kernel path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// φ(idx) = Σ_k phases[k] * b_k, b_k = (idx >> k) & 1
double phi = 0.0;
double norm = 1.0 ;
Comment on lines +61 to +62

norm *= M_SQRT1_2;

@ryankert01 ryankert01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

@ryankert01

Copy link
Copy Markdown
Member

Need to fix precommit and copilot's comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants