Ctrls performance by JPRichings · Pull Request #739 · QuEST-Kit/QuEST

JPRichings · 2026-04-28T20:57:32Z

Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.

Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616

JPRichings · 2026-04-29T07:25:58Z

Todo:

Confirm no horrible race condition is introduced by cudamemcpyToSymbol
Set the ctrls buffer size programmatically or to some sensible limit, say 50? (at least 64KB of constant memory on device so no worries giving ourselves some extra room
Apply fix to all kernels in QuEST

Assumptions in this code:

Single operation on quantum register at a time which allows the assumption that ctrls can be overwritten before each subsequent gate application.

otbrown · 2026-05-03T16:44:03Z

Alloc size 64 qubits -- add as macro somewhere if not done already MAX_QUREG_SIZE

TysonRayJones · 2026-05-03T16:45:39Z

-const int NUM_THREADS_PER_BLOCK = 128;
+const int NUM_THREADS_PER_BLOCK =128;
+
+__device__ __constant__ int ctrl_device[30]; 


TODO: use a MAX_NUM_QUBITS = 64 or something constant in constants.hpp

TysonRayJones · 2026-05-03T18:28:13Z

Reminder of other stuff from meeting:

storing targets also in constant mem
retain passing ptrs to kernels; dispatcher passes constant mem ptr

JPRichings · 2026-05-05T20:32:54Z

Way easier to do this instead:

Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above

#define TOTAL_PARAMS (8000) // ints

typedef struct {
    int param[TOTAL_PARAMS];
} param_large_t;

__global__ void kernelLargeParam(__grid_constant__ const param_large_t p,...) {
    // access all parameters from p
}

int main() {
    param_large_t p_large;
    kernelLargeParam<<<GRIDDIM,BLOCKDIM>>>(p_large,...);
    cudaDeviceSynchronize();
}

Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only.

reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/

I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future.

(this is also much better than the variadic kernel idea I had earlier today)

Other benefits:

Performance: sweet sweet ~~200ns~~ more like a few micro seconds now ive checked my profiling data (could be another factor of two here) performance save possible by removing cudaMemcpyToSymbol call.
Removes any risk of race condition

JPRichings · 2026-05-05T21:39:12Z

Profiling to confirm but here is an extra factor of 2 over #729 (comment):

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 0 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 1 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 1 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 0 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 0 with probability 0.5
Qubit 13 measured in state 1 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|11100010110001110010>
QFT run time: 0.00141512s
Total run time: 2.36306s

JPRichings · 2026-05-05T21:47:02Z

profile to confirm no data movement outside of kernel launch at all:

…memory copy

TysonRayJones · 2026-05-06T02:02:01Z

Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the target qubits of kernel_statevec_anyCtrlAnyTargDiagMatr_sub are also __grid_constant__? It'd be gross (but not the end of the world) if each kernel must manually isolate the targets array from a single ctrlsAndTargs array (although sneakily, kernel_statevec_anyCtrlOneTargDenseMatr_subA already performs such a trick).

The pattern of gpu_subroutines.cpp "secretly" passing __device__ __constant__ to unchanged kernels which still receive a ctrls ptr (and separately, a targs ptr) is still more pleasant to me, because then only gpu_subroutines.cpp needs to know about the optimisation being done, at copy-time. But a 2x speedup is nothing to sneeze at (if I interpreted that right)!

Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)?

JPRichings · 2026-05-06T07:33:39Z

I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to do first with this optimisation) capture by value in the kernel launch a c style array. The __grid_const__ is just to make sure its treated as read only on the device.

I think this is much cleaner than the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In gpu_subroutines.cpp if I can get util_getSorted to return directly to the struct we are using for hiding the c array then there will be no change to that code apart from a function returning to a new small struct opposed to a vector<int>.

We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1.

JPRichings · 2026-05-06T07:36:50Z

Just to hammer home the performance improvement here. The red is the explicit cudamemcpytosymbol is the previous ctrls buffer which is now eliminated in this version.

JPRichings · 2026-05-06T07:39:09Z

Finally usual caveats that correctness checking needs to take place with a full run of the test suite!

…ance

…rmance

TysonRayJones · 2026-05-10T00:01:18Z

Oh I see! And this is for ~12-15 qubits? Pretty neat to reduce the CUDA overhead generally and lower the "GPU is worthwhile" threshold! 🎉

There might be a chance for a more systematic change here! The...

struct QubitList_t {
    int indices[64];
    int length;
};

struct could actually live in a higher level than just the GPU backend, in order to avoid STL copy overheads. See #720. Whatcha think??

JPRichings · 2026-05-10T10:27:26Z

The results above are for 20 qubits but here are some other results for grace-hopper:

10 qubits

QFT run time: 0.00058519s
Total run time: 2.33269s

12 qubits

QFT run time: 0.000601382s
Total run time: 2.35085s

14 quits:

QFT run time: 0.000762312s
Total run time: 2.35207s

16 qubits:

QFT run time: 0.000954826s
Total run time: 2.34424s

18 qubits:

QFT run time: 0.00115166s
Total run time: 2.36775s

Yes I agree we should move the QubitList_t struct higher and think about versions of some of the sorting routine to return directly to it so we avoid the calls to std::copy I have introduced in gpu_subroutines.cpp which I have only done for now to make sure I can run the tests with this as a wider change and verify that I have not introduced a correctness issue.

JPRichings · 2026-05-10T10:45:57Z

268/269 tests passing on Grace-hopper. Failed at the last hurdle:

ctest --rerun-failed
Test project /work/jriching/QuEST/QuEST/build_qft_perf_test
    Start 269: density evolution
1/1 Test #269: density evolution ................***Failed    4.11 sec

Looking at LatestTest.Log:

269/269 Testing: density evolution
269/269 Test: density evolution
Command: "/work/jriching/QuEST/QuEST/build_qft_perf_test/tests/tests" "density evolution"
Directory: /work/jriching/QuEST/QuEST/build_qft_perf_test/tests
"density evolution" start time: May 10 11:07 UTC
Output:
----------------------------------------------------------

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     0
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  1
  num qubits in qureg:   6
  max num qubit perms:   0
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  CPU
  CPU + OMP
  GPU
  GPU + OMP

Filters: "density evolution"
Randomness seeded to: 3308388244


A fatal internal QuEST error occurred. A CUDA (or cuQuantum) API function ("cudaDeviceSynchronize()", called by "gpu_sync()" at line 423 of file /work/jriching/QuEST/QuEST/quest/src/gpu/gpu_config.cpp) unexpectedly failed with error message: "an illegal memory access was encountered".  Please report this to the QuEST developers. QuEST will now exit...
<end of output>
Test time =   4.12 sec
----------------------------------------------------------
Test Failed.
"density evolution" end time: May 10 11:07 UTC
"density evolution" time elapsed: 00:00:04
----------------------------------------------------------

I have also noticed that there is a pattern in some of the very slow running (> 700 sec) tests:

applyMultiControlled* > 700 sec
applyMultiStateControlled* > 800 sec

Additionally I need to look into gpu_thrust.cuh as many tests that run for >150 sec have some reference to devints used in preparation of the call to the thrust kernel. Need to be more careful here as we aren't launching a kernel explicitly this might be a sensible way to move data across to the device.

JPRichings · 2026-05-10T11:04:20Z

Bit verbose but this will be useful for following up on performance:

QuEST/QuEST/buildt/Testing/Temporary/CTestCostData.txt

calcExpecPauliStr 1 111.374
calcExpecPauliStrSum 1 23.62
calcExpecNonHermitianPauliStrSum 1 23.6486
calcProbOfBasisState 1 6.78975
calcProbOfQubitOutcome 1 3.995
calcProbOfMultiQubitOutcome 1 111.467
calcProbsOfAllMultiQubitOutcomes 1 196.123
calcTotalProb 1 8.74531
calcPurity 1 8.95273
calcPartialTrace 1 72.9458
calcReducedDensityMatrix 1 83.3372
calcInnerProduct 1 9.857
calcFidelity 1 7.67669
calcDistance 1 9.8056
calcExpecFullStateDiagMatr 1 5.54069
calcExpecNonHermitianFullStateDiagMatr 1 5.57989
calcExpecFullStateDiagMatrPower 1 7.80309
calcExpecNonHermitianFullStateDiagMatrPower 1 5.54473
createKrausMap 1 3.34362
destroyKrausMap 1 3.31724
syncKrausMap 1 3.32191
setKrausMap 1 3.33788
setInlineKrausMap 1 3.32143
createInlineKrausMap 1 3.31377
createSuperOp 1 3.34811
syncSuperOp 1 3.32742
destroySuperOp 1 3.32706
setSuperOp 1 3.3244
setInlineSuperOp 1 3.33166
createInlineSuperOp 1 3.32602
setInputErrorHandler 1 3.32443
setMaxNumReportedSigFigs 1 3.33791
setNumReportedNewlines 1 3.31851
setSeeds 1 3.70708
setSeedsToDefault 1 3.33279
getSeeds 1 3.31845
getNumSeeds 1 3.32073
setValidationOn 1 3.32466
setValidationOff 1 3.32849
setValidationEpsilon 1 3.34232
getValidationEpsilon 1 3.31746
setValidationEpsilonToDefault 1 3.33799
getGpuCacheSize 1 3.31944
mixDephasing 1 3.73546
mixDepolarising 1 3.79199
mixDamping 1 3.71387
mixPaulis 1 3.80087
mixTwoQubitDephasing 1 5.78137
mixTwoQubitDepolarising 1 8.37151
mixKrausMap 1 162.109
mixSuperOp 1 47.6485
mixQureg 1 14.2456
initQuESTEnv 1 3.34704
initCustomQuESTEnv 1 3.32871
finalizeQuESTEnv 1 3.31579
syncQuESTEnv 1 3.33123
isQuESTEnvInit 1 3.3398
getQuESTEnv 1 3.34385
initBlankState 1 3.38173
initZeroState 1 3.38048
initPlusState 1 3.38689
initClassicalState 1 6.89791
initDebugState 1 3.38512
initRandomPureState 1 3.33594
initRandomMixedState 1 3.37718
initArbitraryPureState 1 3.38164
setQuregAmps 1 3.47167
setDensityQuregFlatAmps 1 106.265
setDensityQuregAmps 1 106.94
setQuregToRenormalized 1 4.37365
setQuregToPauliStrSum 1 36.7767
setQuregToWeightedSum 1 5.6768
setQuregToMixture 1 5.67355
getCompMatr1 1 3.34783
getCompMatr2 1 3.32858
getDiagMatr1 1 3.31559
getDiagMatr2 1 3.32089
getInlineCompMatr1 1 3.35783
getInlineCompMatr2 1 3.32345
getInlineDiagMatr1 1 3.32309
getInlineDiagMatr2 1 3.32536
createCompMatr 1 3.31991
createDiagMatr 1 3.33885
createFullStateDiagMatr 1 3.3267
createCustomFullStateDiagMatr 1 3.32743
destroyCompMatr 1 3.32808
destroyDiagMatr 1 3.31117
destroyFullStateDiagMatr 1 3.33926
syncCompMatr 1 3.31526
syncDiagMatr 1 3.3278
syncFullStateDiagMatr 1 3.34792
setCompMatr 1 3.32386
setDiagMatr 1 3.34495
setInlineCompMatr 1 3.32217
setInlineDiagMatr 1 3.33285
createInlineCompMatr 1 3.34076
createInlineDiagMatr 1 3.32878
applyPauliStr 1 149.474
applyControlledPauliStr 1 146.788
applyMultiControlledPauliStr 1 738.002
applyMultiStateControlledPauliStr 1 850.248
applyPauliGadget 1 149.776
applyControlledPauliGadget 1 149.601
applyMultiControlledPauliGadget 1 737.999
applyMultiStateControlledPauliGadget 1 850.479
applyCompMatr1 1 3.6647
applyControlledCompMatr1 1 5.31939
applyMultiControlledCompMatr1 1 148.922
applyMultiStateControlledCompMatr1 1 191.736
applyCompMatr2 1 5.31837
applyControlledCompMatr2 1 11.7938
applyMultiControlledCompMatr2 1 149.321
applyMultiStateControlledCompMatr2 1 179.492
applyCompMatr 1 164.008
applyControlledCompMatr 1 152.858
applyMultiControlledCompMatr 1 755.555
applyMultiStateControlledCompMatr 1 865.366
applyDiagMatr1 1 3.67318
applyControlledDiagMatr1 1 5.31798
applyMultiControlledDiagMatr1 1 148.511
applyMultiStateControlledDiagMatr1 1 190.596
applyDiagMatr2 1 5.3313
applyControlledDiagMatr2 1 11.7908
applyMultiControlledDiagMatr2 1 148.573
applyMultiStateControlledDiagMatr2 1 179.98
applyDiagMatr 1 150.188
applyControlledDiagMatr 1 148.879
applyMultiControlledDiagMatr 1 739.293
applyMultiStateControlledDiagMatr 1 851.13
applyDiagMatrPower 1 154.502
applyControlledDiagMatrPower 1 151.267
applyMultiControlledDiagMatrPower 1 745.335
applyMultiStateControlledDiagMatrPower 1 853.222
applyHadamard 1 3.67582
applyControlledHadamard 1 5.30016
applyMultiControlledHadamard 1 148.973
applyMultiStateControlledHadamard 1 190.854
applyPauliX 1 3.67056
applyControlledPauliX 1 5.29498
applyMultiControlledPauliX 1 149.194
applyMultiStateControlledPauliX 1 192.415
applyPauliY 1 3.66142
applyControlledPauliY 1 5.29976
applyMultiControlledPauliY 1 149.317
applyMultiStateControlledPauliY 1 190.198
applyPauliZ 1 3.67885
applyControlledPauliZ 1 5.30673
applyMultiControlledPauliZ 1 148.816
applyMultiStateControlledPauliZ 1 190.268
applyT 1 3.6669
applyControlledT 1 5.34152
applyMultiControlledT 1 148.989
applyMultiStateControlledT 1 190.991
applyS 1 3.65763
applyControlledS 1 5.28771
applyMultiControlledS 1 149.02
applyMultiStateControlledS 1 190.209
applySwap 1 5.29885
applyControlledSwap 1 11.7378
applyMultiControlledSwap 1 148.57
applyMultiStateControlledSwap 1 179.61
applySqrtSwap 1 5.29736
applyControlledSqrtSwap 1 11.7838
applyMultiControlledSqrtSwap 1 148.241
applyMultiStateControlledSqrtSwap 1 179.783
applyRotateX 1 3.66782
applyControlledRotateX 1 5.30752
applyMultiControlledRotateX 1 149.678
applyMultiStateControlledRotateX 1 190.092
applyRotateY 1 3.67177
applyControlledRotateY 1 5.31969
applyMultiControlledRotateY 1 149.699
applyMultiStateControlledRotateY 1 190.291
applyRotateZ 1 3.68264
applyControlledRotateZ 1 5.3241
applyMultiControlledRotateZ 1 149.333
applyMultiStateControlledRotateZ 1 191.383
applyRotateAroundAxis 1 3.66109
applyControlledRotateAroundAxis 1 5.3217
applyMultiControlledRotateAroundAxis 1 148.817
applyMultiStateControlledRotateAroundAxis 1 190.488
applyMultiQubitNot 1 149.64
applyControlledMultiQubitNot 1 147.36
applyMultiControlledMultiQubitNot 1 741.002
applyMultiStateControlledMultiQubitNot 1 844.728
applyPhaseGadget 1 150.134
applyControlledPhaseGadget 1 150.011
applyMultiControlledPhaseGadget 1 734.513
applyMultiStateControlledPhaseGadget 1 845.181
applyPhaseFlip 1 3.66469
applyTwoQubitPhaseFlip 1 5.32346
applyPhaseShift 1 3.6686
applyTwoQubitPhaseShift 1 5.2961
applyMultiQubitPhaseFlip 1 150.797
applyMultiQubitPhaseShift 1 148.513
applyQuantumFourierTransform 1 212.901
applyFullQuantumFourierTransform 1 4.39363
applyQubitProjector 1 11.0974
applyMultiQubitProjector 1 117.39
applyForcedQubitMeasurement 1 17.8114
applyForcedMultiQubitMeasurement 1 224.574
applyMultiQubitMeasurement 1 224.563
applyMultiQubitMeasurementAndGetProb 1 224.743
applyQubitMeasurement 1 10.5807
applyQubitMeasurementAndGetProb 1 10.6015
applyFullStateDiagMatr 1 5.67109
applyFullStateDiagMatrPower 1 8.01737
applyNonUnitaryPauliGadget 1 118.27
leftapplySwap 1 5.29122
leftapplyPauliX 1 3.6467
leftapplyPauliY 1 3.66651
leftapplyPauliZ 1 3.65803
leftapplyPauliStr 1 147.699
leftapplyPauliGadget 1 148.527
leftapplyCompMatr1 1 3.67249
leftapplyCompMatr2 1 5.27853
leftapplyDiagMatr1 1 3.67793
leftapplyDiagMatr2 1 5.27731
rightapplySwap 1 5.1439
rightapplyPauliX 1 3.63003
rightapplyPauliY 1 3.63552
rightapplyPauliZ 1 3.65388
rightapplyPauliStr 1 130.827
rightapplyPauliGadget 1 130.796
rightapplyCompMatr1 1 3.64559
rightapplyCompMatr2 1 5.12272
rightapplyDiagMatr1 1 3.64558
rightapplyDiagMatr2 1 5.12309
leftapplyCompMatr 1 160.32
leftapplyDiagMatr 1 147.98
leftapplyDiagMatrPower 1 149.051
leftapplyMultiQubitNot 1 148.03
leftapplyPhaseGadget 1 147.817
rightapplyCompMatr 1 137.316
rightapplyDiagMatr 1 131.113
rightapplyDiagMatrPower 1 132.7
rightapplyMultiQubitNot 1 130.699
rightapplyPhaseGadget 1 131.18
leftapplyFullStateDiagMatr 1 5.62564
rightapplyFullStateDiagMatr 1 5.61791
leftapplyFullStateDiagMatrPower 1 5.60299
rightapplyFullStateDiagMatrPower 1 5.62341
leftapplyQubitProjector 1 11.0756
rightapplyQubitProjector 1 10.6165
leftapplyMultiQubitProjector 1 114.89
rightapplyMultiQubitProjector 1 114.963
leftapplyPauliStrSum 1 3.50447
rightapplyPauliStrSum 1 3.51169
getPauliStr 1 3.37964
getInlinePauliStr 1 3.32521
createPauliStrSum 1 3.32533
createInlinePauliStrSum 1 3.33886
createPauliStrSumFromFile 1 3.32114
createPauliStrSumFromReversedFile 1 3.3202
destroyPauliStrSum 1 3.32624
createQureg 1 3.32918
createDensityQureg 1 3.3562
createForcedQureg 1 3.32444
createForcedDensityQureg 1 3.3414
createCustomQureg 1 3.32286
createCloneQureg 1 3.42995
destroyQureg 1 3.31656
getQuregAmp 1 3.32783
getDensityQuregAmp 1 3.33482
getQuregAmps 1 3.32589
getDensityQuregAmps 1 3.37533
getQcomp 1 3.31289
complex 0 1
density 0 0
density 0 0
density evolution 0 0
density evolution 0 0

Test configuration (I think I forgot to reduce the complexity):

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     0
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  1
  num qubits in qureg:   6
  max num qubit perms:   0
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  CPU
  CPU + OMP
  GPU
  GPU + OMP

JPRichings · 2026-05-10T12:26:54Z

Coming back to my comment on gpu_thrust.cuh performance:

This needs to be thought about in more detail as its going to need some very careful thought about when data should be moved to GPU in functor_insertBits and what is expected by thrust in thrust::make_transform_iterator.

I think given that this seems like a distinct change I will raise a separate pull request for it.

JPRichings · 2026-05-12T09:35:28Z

After pulling devel in and rerunning tests:

273/274 Test #273: complex arithmetic ............................   Passed    3.35 sec
        Start 274: density evolution
274/274 Test #274: density evolution .............................***Failed    4.14 sec

99% tests passed, 1 tests failed out of 274

Total Test time (real) = 10804.53 sec

The following tests FAILED:
        274 - density evolution (Failed)
Errors while running CTest

274/274 Testing: density evolution
274/274 Test: density evolution
Command: "/work/jriching/QuEST/QuEST_ctrls/build/tests/tests" "density evolution"
Directory: /work/jriching/QuEST/QuEST_ctrls/build/tests
"density evolution" start time: May 12 00:27 UTC
Output:
----------------------------------------------------------

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     0
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   0
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  GPU + OMP

Filters: "density evolution"
Randomness seeded to: 1767441457


A fatal internal QuEST error occurred. A CUDA (or cuQuantum) API function ("cudaDeviceSynchronize()", called by "gpu_sync()" at line 423 of file /work/jriching/QuEST/QuEST_ctrls/quest/src/gpu/gpu_config.cpp) unexpectedly failed with error message: "an illegal memory access was encountered".  Please report this to the QuEST developers. QuEST will now exit...
<end of output>
Test time =   4.14 sec
----------------------------------------------------------
Test Failed.
"density evolution" end time: May 12 00:27 UTC
"density evolution" time elapsed: 00:00:04
----------------------------------------------------------

End testing: May 12 00:27 UTC

This needs further investigation cudastanatize run over the test should hopefully help to track this one down.

JPRichings · 2026-05-12T09:43:48Z

Some of the test failures here on github are due to rocm not recognising __grid_constant__ which could be very annoying...

But does look like this is supported in rocm docs:

https://rocm.docs.amd.com/projects/llvm-project/en/docs-7.2.3/LLVM/clang/html/AttributeReference.html#grid-constant

Further investigation required.

Note: This doesn't scupper us it just means we might need to provide a custom type which changes for cuda and hip compiles. The use of __grid_constant__ is nice as it effectively declares data constant on the device on capture on kernel launch but the performance improvement is mainly due to capture by value in the kernel launch.

JPRichings · 2026-05-14T14:25:13Z

Capture of simple struct in kernel launch tested on AMD so we might loose the benefit of __grid_constant__ to make the device data read only but we will still remove the overhead of implicity or explicit copy of data to the device outside of the kernel launch

TysonRayJones · 2026-05-17T19:04:13Z

Capture of simple struct in kernel launch tested on AMD so we might loose the benefit of __grid_constant__ to make the device data read only but we will still remove the overhead of implicity or explicit copy of data to the device outside of the kernel launch

Aw shoot! Does removing __grid_constant__ still achieve correct behaviour? Could do a regrettable but defensible macro (similar to what inliner.hpp does), like:

#if defined(__NVCC__)

    #define _NO_COPY_OPT __grid_constant__

#elif defined(__HIP__)

    #define _NO_COPY_OPT

#endif

Btw the reference you link indicates:

The __grid_constant__ attribute can be applied to a const-qualified kernel function argument and allows compiler to take the address of that argument without making a copy. The argument applies to sm_70 or newer GPUs, during compilation with CUDA-11.7(PTX 7.7) or newer, and is ignored otherwise.

Use of __grid_constant__ appears to require we bump the minimum compute capability. Right now, I'm fairly sure the minimum is like 3.0 (or whatever is imposed by Thrust)! My main QuEST development was on a Quadro P6000, so a CC of 6.1 is tattooed into my brain ehehe. Do we object to a bump?? I'm not personally too offended because the min CC appearing on the NVIDIA site is now 7.5, but it is kind of cool to try to support old GPUs - especially if one believes consumer-end GPUs are becoming harder to obtain. Could we retain support for CC < 7.0 using a similar macro trick as above?

JPRichings · 2026-05-17T19:32:40Z

I agree macro guarding this will probably be the solution I need to implement this change in one of the kernels and retest to guarantee the performance holds and then confirm this works for Nvidia and AMD.

I agree keeping the GPU capability support as old as possible is best to support a wider set of users so I will explore adding this into the solution so we don't have to bump the compute capability if we don't need to.

This is to circumvent the std::vector performance overheads visible in few-qubit simulation (responsible for a performance regression from v3; see #720), and also so that qubit lists can be passed directly to CUDA kernels without conversion (as explored in #739).

TysonRayJones · 2026-05-17T21:50:17Z

Heads up that SmallList has reached devel

TysonRayJones · 2026-05-24T20:12:07Z

    return number;
 }

+INLINE qindex insertBits(qindex number, const int* bitIndices, int numIndices, int bitValue) {


Btw I incidentally made these same bitwise changes in #750, but where alle existing bitwise functions receive const pointers, as was necessary for SmallView in #754.

TysonRayJones · 2026-05-24T20:13:11Z

+    int ctrl_device[64];
+} ctrl_device_t;
+
+struct QubitList_t {


QubitList_t should be made entirely redundant by SmallList (which we can still rename)

TysonRayJones · 2026-05-29T01:10:14Z

Is there a blocker to this beyond further performance investigation and/or updating given my upstream changes? I'm happy to make a PR to this migrating it over to List64, and handle the aforementioned macros.

(Are you seeing a performance difference between ctrl_device_t and QubitList_t? I can switch all over to ConstList64 where possible)

JPRichings · 2026-05-29T08:51:58Z

No blocker technical just human time constraints I will get to this on Monday and have agreed a strategy with Oliver for merging this with everything else.

otbrown · 2026-06-01T15:05:54Z

Just noting here that we defer this to v4.3.1.

(Sidenote: semver says that the patch index is only for changes that fix "incorrect behaviour", but the minor version is only a MAY change if there are significant changes to the private behaviour, so where does that leave a performance patch? I suppose one could declare being much slower than it needs to be on GPU a bug, and therefore it's a bug fix...)

TysonRayJones · 2026-06-03T05:06:24Z

Hehe a patch index bump is inoffensive to me - they're "free" in my mental model ¯\(ツ)/¯

otbrown · 2026-06-03T10:22:02Z

I think a post Unitary Hack 4.3.1 makes sense then (assuming we've already got 4.3.0 out by then!)

* Fix applyMultiStateControlledSqrtSwap argument list (#738) Remove numControls argument from applyMultiStateControlledSqrtSwap overloaded definition taking std::vector<int> (cherry picked from commit 9c20792) Co-authored-by: D-Exposito <dexposito@cesga.es> * Trotterisation test update (#728) * tests/unit/trotterisation.cpp: updated to use REQUIRE_AGREE and cached statevecs and densmats, and both permutePaulis options * tests/utils/compare.hpp/cpp: added setters for test epsilon * tests/unit/trotterisation.cpp: adjusted test epsilon for quad precision imaginary time evolution tests * tests/unit/trotterisation.cpp: moved unitary time evo test to REQUIRE_AGREE * tests/utils/cache.hpp/cpp: added additional utilities for creating and destroying temp caches (which I guess makes them not caches?) with a set number of qubits * tests/unit/trotterisation.cpp: updated unitary time evo test to test across deployments * tests/unit/trotterisation.cpp: reduced number of qubits and increased number of steps to admit the possibility of testing density matrices too * tests/unit/trotterisation.cpp: added density matrix tests * reduce test precision to lazily pass CPU clang quad-precision * skip Trotter tests in paid CI * changing varname convention * renaming cache funcs --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> * added Daniel Patino to authorlist * CMake warn when non-release build (#742) --------- Co-authored-by: Oliver Thomson Brown <otbrown@users.noreply.github.com> * Stop Trotter funcs mutating PauliStrSum (#740) Formerly, the Trotter functions (such as applyTrotterizedPauliStrSumGadget()), when passed permutePaulis=true, would randomly permutate the order of the passed PauliStrSum, mutating it and affecting the outputs of subsequent functions like reportPauliStrSum(). The function also contained superfluous memory allocs/copies equal in size to the PauliStrSum. Now, the PauliStrSum is never mutated, and an internally allocated ordering list keeps track of the randomised permutation. We also updated the doc, renamed permutePaulis to permuteTerms, and improved validation. Note that 'permuteTerms' had not yet reached main/release, so these changes do not need to be documented in the v4.3 release notes. * Created custom backend complex types (#729) Created cpu_qcomp and gpu_qcomp (from a shared base_qcomp) to avoid std::complex arithmetic operators in hot loops which caused performance issues. Removed all prior compiler flags and related scaffolding attempting to mitigate the performance issue. Also gave MSVC build the params `/Zc:preprocessor -Xcompiler=/Zc:preprocessor /bigobj` as needed for compilation of the unit tests on my windows machines. * Replace vector<int> with SmallList (a stack array) (#743) This is to circumvent the std::vector performance overheads visible in few-qubit simulation (responsible for a performance regression from v3; see #720), and also so that qubit lists can be passed directly to CUDA kernels without conversion (as explored in #739). * Added few-qubit optimisations (#750) Optimisations include: - Adopted SmallView (const SmallList&) to avoid superfluous SmallList copies - Made internally created matrices static - Change accelerator dynamic function vectors to static arrays - Exit all validators early when validation is disabled Additional cleanup includes: - Tidied accelerator macros (replaced param-specific macros like "numCtrls" and "numTargs" with "param") - Fill ctrlStates vectors with default before localiser - Renamed getBitsFromInteger to setToBitsOfInteger - Adopted const in bitwise.hpp to better express intent Note that the naming of SmallList and SmallView will be subsequently changed to List64 and ConstList64 * Renamed debug API functions to contain "QuEST" (#752) * Renamed environment variables to begin with"QUEST" (#755) * Renamed CMake vars and preprocessors (#756) such that they all begin with QUEST, but some have additional changes * Renamed Small(List|View) to (Const)List64 (#757) * Defer Catch2 test discovery so that we can compile MPI tests on systems which cannot actually run with MPI, because they are missing an MPI or UCX library file, as is witnessed in the CI (when compiling with MPICH). It's generally irksome too to trigger an execution of the test binary (which itself initialises QuEST) during build when on a HPC platform with distinct submit and compute nodes * Enable user to take ownership of MPI (#722) * Added ENABLE_SUBCOMM build option * Moved from MPI_COMM_WORLD to mpiQuestComm * Decided passing *MPI_Comm was probably overly cautious, and updated function name to comm_getMpiComm * environment.cpp: added methods to reset rank and numNodes, and reporting for subcomm compiled * comm_config.hpp/cpp: added comm_setMpiComm * CMakeLists.txt: PUBLIC MPI::MPI_CXX turned out to be unhelpful, even for SubComm, because of course it enforces CXX * Added new custom QuESTEnv initialiser which allow user to positively declare that they take ownership of MPI * validation.cpp: updated comm_end call * comm_config.hpp: added config.h include so COMPILE_MPI is actually defined * subcommunicator.h/cpp: implemented QuESTEnv initialiser with custom MPI_Comm * CMake: added subcommunicator.cpp * comm_config.hpp: added missing config.h include... * comm_config.cpp: explicitly initialise mpiCommQuest to MPI_COMM_NULL, updated setComm for init only workflow * quest.h: added subcommunicator header * CMake: added MPI to application binaries when SUBCOMM is enabled * comm_routines.cpp: post Irecv before Isend which probably won't do anything but it makes MPI library implementers less nervous * tests: added new env test for initCustomMpiQuESTEnv * Added error throws to comm_config to cover new scenarios of badness with user owned MPI * subcommunicator.cpp: updated var names to match QuEST style * tests/unit/initialisations.cpp: slightly modified setQuregAmps test to avoid unexpected test failure due to range checking when compild in Debug configuration * Updated validation in comm_setMpiComm Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> * userOwnsMpi int->bool * comm_config.cpp: corrected call to MPI_Comm_free * subcommunicator.cpp: userOwnsMpi int->bool * subcommunicator.cpp: added comm_isInit guard around comm_setMpiComm * environment.cpp: USER_OWNS_MPI -> userOwnsMpi * comm_init: fixed case where useDistrib = 0 and userOwnsMpi = true * comm_init: moved (recently) misplaced MPI_Init * AUTHORS.txt: added iarejula-bsc * Added placeholder docstrings to new initialisers * docs/cmake.md: added ENABLE_SUBCOMM to list of QuEST CMake vars * Newly added COMPILE_MPI -> QUEST_COMPILE_MPI * ENABLE_SUBCOMM -> QUEST_ENABLE_SUBCOMM * CMake: corrected OpenMP and subcommunicator pre-processor definitions --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> * Add flush and sync around prints (#763) to reduce the likelihood of users printing from non-root nodes interrupting QuEST root output. This is not bullet-proof; we sync the active communicator rather than MPI_COMM_WORLD so the user-controlled non-participating processes may still be printing. Furthermore, even if all processes participate, some may have outstanding non-root prints that are not aggregated to the user screen by the time MPI_Barrier finishes. But these syncs greatly reduce the change of corruption, and are effectively free! * Add GPU-aware MPICH detection This enables CRAY MPICH platforms to leverage GPU-awareness, greatly accelerating distributed GPU simulation Co-authored-by: JPRichings <james.richings@ed.ac.uk> * Cleanup custom MPI flow (#762) Important changes: - permit user initialisation of MPI when QuEST is not distributed - changed QuESTEnv fields bool from int (e.g. isMultithreaded) - add user-input validation for custom MPI calls - disambiguated comm_config.cpp concepts of "MPI is initialised" (comm_isMpiInit) from "QuEST communication is active" (comm_isActive) - refactored comm_config.cpp flow, especially related to pre-quest-init flow (during validation) - added Oliver's custom-MPI examples (from #712) - moved new API functions to experimental.h - tweaked reportQuESTEnv output grouping * Added user-control of GPU num threads per block (#736) Added: - QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK CMake option - QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK environment variable - setQuESTNumGpuThreadsPerBlock() API function - getQuESTNumGpuThreadsPerBlock() API function - set_num_gpu_threads examples in examples/extended --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> * Fix compiler warnings (#770) Beware this included removing the superfluous `numControls` argument from the C++only `std::vector` overload of `applyMultiStateControlledCompMatr2`, which is technically a teeny tiny API break ¯\_(ツ)_/¯ * tests/unit/debug.cpp: updated setQuESTSeeds validation tests to include new validation (#771) Updated number of seeds test to use a valid pointer and added a separate NULL pointer test. * Fix Windows CI test_free.yml: added Release config to ctest commands (#773) --------- Co-authored-by: D-Exposito <dexposito@cesga.es> Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> Co-authored-by: JPRichings <james.richings@ed.ac.uk>

JPRichings · 2026-06-03T12:08:57Z

I agree we push this to 4.3.1 so we get 4.3 out the door.

TysonRayJones and others added 5 commits May 28, 2025 03:06

v4.1 (QuEST-Kit#622)

2d1a6fb

post v4.1

4d44ec8

Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616

v4.2 (QuEST-Kit#692)

9d7618d

update to remove implicit thrust vector host to device copy

0acd9b9

remove thrust devic_vector

d7bf558

JPRichings mentioned this pull request May 1, 2026

Weaning off std::complex arithmetic #729

Merged

TysonRayJones reviewed May 3, 2026

View reviewed changes

alternative approach where capture array in struct to remove explict …

324e5be

…memory copy

JPRichings added 12 commits May 9, 2026 20:31

identify opportutnities for performance improvement

e04dc1f

kernel_statevec_anyCtrlSwap_subA performance change

abf0387

update kernel_statevec_anyCtrlSwap_subB for performance

a3cd184

update kernel_statevec_anyCtrlSwap_subC for performance

5dd472d

update kernel_statevec_anyCtrlOneTargDenseMatr_subB for performance

a23b231

update kernel_statevec_anyCtrlTwoTargDenseMatr_sub for performance

6ea9d76

update kernel_statevec_anyCtrlFewTargDenseMatr for performance

5ef2d27

update kernel_statevector_anyCtrlPauliTensorOrGadget_subA for perform…

a353450

…ance

update kernel_statevector_anyCtrlPauliTensorOrGadget_subA for perform…

1f57274

…ance

update kernel_statevector_anyCtrlPauliTensorOrGadget_subB for perform…

6604823

…ance

update kernel_densmatr_partialTrace_sub for performance

ca95173

update kernel_statevec_calcProbsOfAllMultiQubitOutcomes_sub for perfo…

ec71af6

…rmance

TysonRayJones mentioned this pull request May 10, 2026

Replace vector<int> with stack-list #743

Merged

This was referenced May 10, 2026

Performance of Thrust kernels #744

Open

Performance of applyMultiControlled and applyMultiStateControlled #745

Open

TysonRayJones mentioned this pull request May 11, 2026

Profile and optimise-away small GPU allocations #749

Open

JPRichings and others added 3 commits May 11, 2026 16:26

Merge branch 'QuEST-Kit:main' into ctrls_performance

c728427

Merge branch 'devel' into ctrls_performance

10f0e6b

fix merge issues

54ce669

JPRichings mentioned this pull request May 18, 2026

Replace vector<int> with SmallList (a stack array) (#743) JPRichings/QuEST#1

Open

TysonRayJones reviewed May 24, 2026

View reviewed changes

otbrown mentioned this pull request May 28, 2026

v4.3 release #726

Draft

17 tasks

Conversation

JPRichings commented Apr 28, 2026

Uh oh!

JPRichings commented Apr 29, 2026

Uh oh!

otbrown commented May 3, 2026

Uh oh!

TysonRayJones May 3, 2026

Choose a reason for hiding this comment

Uh oh!

TysonRayJones commented May 3, 2026

Uh oh!

JPRichings commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 6, 2026

Uh oh!

TysonRayJones commented May 10, 2026

Uh oh!

JPRichings commented May 10, 2026

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 17, 2026

Uh oh!

JPRichings commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 17, 2026

Uh oh!

TysonRayJones May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TysonRayJones May 24, 2026

Choose a reason for hiding this comment

Uh oh!

TysonRayJones commented May 29, 2026

Uh oh!

JPRichings commented May 29, 2026

Uh oh!

otbrown commented Jun 1, 2026

Uh oh!

TysonRayJones commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

otbrown commented Jun 3, 2026

Uh oh!

JPRichings commented Jun 3, 2026

JPRichings commented May 5, 2026 •

edited

Loading

JPRichings commented May 5, 2026 •

edited

Loading

JPRichings commented May 5, 2026 •

edited

Loading

TysonRayJones commented May 6, 2026 •

edited

Loading

JPRichings commented May 6, 2026 •

edited

Loading

JPRichings commented May 6, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

JPRichings commented May 12, 2026 •

edited

Loading

JPRichings commented May 12, 2026 •

edited

Loading

JPRichings commented May 14, 2026 •

edited

Loading

JPRichings commented May 17, 2026 •

edited

Loading

TysonRayJones May 24, 2026 •

edited

Loading

TysonRayJones commented Jun 3, 2026 •

edited

Loading