perf(qdp): inline norm calculation in phase kernel loop for 5% speedup by botszhuang · Pull Request #1380 · apache/mahout

botszhuang · 2026-06-06T03:00:48Z

Related Issues

Relates to #1227

Changes

Why

This PR aims to strengthen the CUDA kernel implementation of phase.cu by optimizing the norm calculation.

Through PTX analysis and empirical benchmarks, I found that combining the norm multiplication into the existing loop yields a better performance improvement than using pow() or solely focusing on branch divergence.

How

Optimization Details & Insights: norm Calculation Optimization (The Real Winner 🚀)

Hypothesis: Using pow(M_SQRT1_2, num_qubits) via SFU would speed up the normalization factor.
Benchmark Reality: pow() actually introduced overhead and slowed down the execution.
The Better Solution: Embedding the norm scaling (norm *= M_SQRT1_2;) directly into the loop alongside the phase accumulation achieved the best results. This allows the compiler to pipelining the instructions effectively.

Benchmark Results

Environment: RunPod GPU instance (CUDA 12.8, NVVM 7.0.1)
Configuration: Grid size: 2048, Block size: 512

Implementation	Execution Time (ms)	Speedup
Original	0.4434 ms	Baseline
Inline Norm in Loop (This PR)	0.3995 ms	~5% Performance Gain

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes

Copilot

Pull request overview

This PR optimizes the CUDA phase_encode_kernel in qdp/qdp-kernels/src/phase.cu by folding the normalization-factor computation into the existing per-qubit loop that accumulates the phase, aiming to reduce overhead and improve kernel throughput.

Changes:

Inline norm accumulation (norm *= M_SQRT1_2) inside the bit loop in phase_encode_kernel.
Remove the separate phase_norm(num_qubits) call for the non-batch kernel path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


    // φ(idx) = Σ_k phases[k] * b_k,  b_k = (idx >> k) & 1
    double phi = 0.0;
+    double norm = 1.0 ;


+
+        norm *= M_SQRT1_2;


ryankert01

overall lgtm

ryankert01 · 2026-06-07T05:30:02Z

Need to fix precommit and copilot's comments.

perf(qdp): inline norm calculation in phase kernel loop for 5% speedup

5a38c15

botszhuang requested review from 400Ping and ryankert01 as code owners June 6, 2026 03:00

ryankert01 requested a review from Copilot June 6, 2026 03:04

Copilot started reviewing on behalf of ryankert01 June 6, 2026 03:04 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Comment thread qdp/qdp-kernels/src/phase.cu

// φ(idx) = Σ_k phases[k] * b_k, b_k = (idx >> k) & 1

double phi = 0.0;

double norm = 1.0 ;

Comment thread qdp/qdp-kernels/src/phase.cu

Comment on lines +61 to +62

norm *= M_SQRT1_2;

ryankert01 reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qdp): inline norm calculation in phase kernel loop for 5% speedup#1380

perf(qdp): inline norm calculation in phase kernel loop for 5% speedup#1380
botszhuang wants to merge 1 commit into
apache:mainfrom
botszhuang:perf/optimize-phase-kernel

botszhuang commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ryankert01 left a comment

Uh oh!

ryankert01 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

botszhuang commented Jun 6, 2026

Related Issues

Changes

Why

How

Optimization Details & Insights: norm Calculation Optimization (The Real Winner 🚀)

Benchmark Results

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

ryankert01 left a comment

Choose a reason for hiding this comment

Uh oh!

ryankert01 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants