Skip to content

Weaning off std::complex arithmetic#729

Merged
TysonRayJones merged 21 commits into
develfrom
custom-complex-types
May 9, 2026
Merged

Weaning off std::complex arithmetic#729
TysonRayJones merged 21 commits into
develfrom
custom-complex-types

Conversation

@TysonRayJones
Copy link
Copy Markdown
Member

@TysonRayJones TysonRayJones commented Apr 17, 2026

Experiment with fully custom cpu_qcomp and gpu_qcomp types to avoid...

  • compiler-specific flags to overcome qcomp = std::complex performance pitfalls
  • HIP-specific functions to overcome cu_qcomp correctness pitfalls

@TysonRayJones TysonRayJones marked this pull request as draft April 17, 2026 19:54
just to isolate this big diff - structural changes to gpu_qcomp are coming
and changed &arr[ind] to arr + ind, for visual clarity
but there's so much boilerplate overlap with cpu_qcomp, I wonder if we should unify the two! We can retain separate cpu_qcomp and gpu_qcomp types (for clarity) through a typedef
compiler stack-overflows when OpenMP is enabled - possibly due to thread-private instantiation of this 2D array?
to debug MSVC + OpenMP compilation failure in CI
story of my life bruddah
nice one chatgpt
@TysonRayJones
Copy link
Copy Markdown
Member Author

@JPRichings @otbrown Here's a first draft of the proposed qcomp changes (ignore that it provokes a stack-overflow bug in multithreaded MSVC, grr). One only really needs to look at cpu_types.hpp and gpu_types.cuh, and at an example of how cpu_qcomp is used within cpu_subroutines.cpp, like here.

You'll see most of the cpu_types.hpp and gpu_types.cuh boilerplate is identical, merely expressing the arithmetic operators of (c|g)pu_qcomp via their qreal components. I propose further simplification by defining a new complex.hpp (or similarly named, perhaps basetypes.hpp?) in core/ (rather than in cpu/ or gpu/) which defines all the operator overloads of a templated type, and (c|g)pu_qcomp gets defined in a typedef of that "ancestor type" in cpu/ or gpu/. Then adding a new operator overload to all backends is trivial (just add one function to complex.hpp), as is adding a backend specific function (define it exclusively within cpu_types.hpp or gpu_types.hpp). Noting this comment here in case you object to the unification (so I can see what to roll-back if I find time to implement beforehand - which is alas unlikely!)

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 20, 2026

Thanks Tyson!

I have 5(!) hours of meetings tomorrow, so unlikely to get to this then, but will make time later in the week!

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 21, 2026

Hi Tyson, one of my meetings got cancelled 🥳

Moving away from the standard library entirely is philosphically upsetting, but clearly has practical benefits!

I propose further simplification by defining a new complex.hpp

I strongly support this proposal. Structurally it would be nice to have the common interface in core to then be specialised/overwritten in cpu/gpu. I am still a bit nervous about needing to maintain a set of maths functions but if we can also contain them in one place in core (relying on the concrete implementations of arithmetic primitives elsewhere if needed), that would be good.

 * TODO:
 * OLD UNPACKERS
 * which I am hestitant to switch to the CPU-style until I better
 * understand why the explicit gpu_qcomp instantiation is necessary
 * (iirc static HIP structs have a different alignment than qcomp?!)

If it's helpful I can volunteer @eessmann as tribute a second pair of eyes and hands to dig into this? We now have a task in QATCH (phase 2 of the Quantum Software Lab) which is broadly "help maintain QuEST (and other scalable emulators)", that Erich is assigned to.

@TysonRayJones
Copy link
Copy Markdown
Member Author

Moving away from the standard library entirely is philosphically upsetting, but clearly has practical benefits!

Yea it's outrageous but at least it won't keep us up at night about esoteric performance pitfalls! (We'll trade that for alignment nightmares)

I strongly support this proposal. Structurally it would be nice to have the common interface in core to then be specialised/overwritten in cpu/gpu. I am still a bit nervous about needing to maintain a set of maths functions but if we can also contain them in one place in core (relying on the concrete implementations of arithmetic primitives elsewhere if needed), that would be good.

Made common definitions in the previous commit. We can still discretionarily fallback to the std::complex operators within the CPU custom maths, paying the NaN performance penalties, when convenient (as pow in cpu_qcomp.hpp currently does).

If it's helpful I can volunteer @eessmann as tribute a second pair of eyes and hands to dig into this? We now have a task in QATCH (phase 2 of the Quantum Software Lab) which is broadly "help maintain QuEST (and other scalable emulators)", that Erich is assigned to.

That'd be super helpful! I preserved the method (used in relation to that comment) in the GPU backend, so I'm hoping things just work on HIP. I don't have an AMD machine handy to test on, so it would be terrific if Erich can give it a whirl! I can also re-setup the (paid) GPU Github Action runners if billing has been sorted out.

Note the new cpu_qcomp type might introduce a similar issue (misalignment of static arrays); currently, the qcomp[2] and qcomp[4] within a DiagMatr1 and DiagMatr2 are being converted/decayed to cpu_qcomp pointers. An explicit copy might be necessary, like in the GPU case, but I've so far been unable to trigger the alignment problem.

@TysonRayJones
Copy link
Copy Markdown
Member Author

You can see the custom backend-specific maths/overloads here and here - so far, it's just pow!

@eessmann
Copy link
Copy Markdown
Contributor

Heya @TysonRayJones,
Happy to help, I have access to HIP on Dirac so I give it a test on MI300s. Out of curiosity, why not go with thrust complex for our gpu complex data type?

@TysonRayJones
Copy link
Copy Markdown
Member Author

Happy to help, I have access to HIP on Dirac so I give it a test on MI300s.

Wew that'd be brilliant!

Out of curiosity, why not go with thrust complex for our gpu complex data type?

It's a good question; previously, because using thrust complex inside CUDA kernels has some pitfalls and (iirc) didn't help solve the operator overload issues already seen of cuComplex (where HIP/rocThrust defines arithmetic operator overloads but CUDA/Thrust doesn't). It was my original intention to use thrust complex for QuEST 4.0, but was forced to switch to cuComplex (although I've forgotten precisely why).

With this PR however, there is a more compelling reason not to use thrust-complex; we must already define a custom complex type for the CPU backend to avoid compiler-specific performance pitfalls. Since we have to define its arithmetic overloads, we may as well re-use this (base) type for the GPU backend too, for better symmetry. We should now never have any issues related to operator overloads on particular platforms like HIP (but I do live in fear of alignment issues, as alluded to!)

so it's the first thing a developer sees, reducing likelihood they inefficiently use qcomp overloads
@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 23, 2026

We can still discretionarily fallback to the std::complex operators within the CPU custom maths, paying the NaN performance penalties, when convenient

Since Erich's identified the relevant fortran-rules flag for all the main compilers, there should be no penalty! Can always just add the flag to the specific relevant source file(s) since it won't be needed elsewhere. Of course if we can't identify the compiler we can just omit.

I can also re-setup the (paid) GPU Github Action runners if billing has been sorted out.

Sadly, we don't have a budget we could use for that, but one of the three of us (Me, Erich, @JPRichings) can be the CI runner 😉

@JPRichings
Copy link
Copy Markdown
Contributor

JPRichings commented Apr 24, 2026

+ state:

V3:

void statevec_initPlusState(Qureg qureg)

V4.2:

void initPlusState(Qureg qureg) {

void localiser_statevec_initUniformState(Qureg qureg, qcomp amp) {

void accel_statevec_initUniformState_sub(Qureg qureg, qcomp amp) {

void gpu_statevec_initUniformState_sub(Qureg qureg, qcomp amp) {

void thrust_statevec_initUniformState(Qureg qureg, cu_qcomp amp) {

Given that the performance is comparable here and the kernel is very simple and is just setting values in the state vector probably points to how some of the changes to gpu kernels in v4. I assume there is a common pattern for gate application across the gates and something there is adding additional overhead that wasn't present in v3?

I need to check but I guess many of the gates end up here:

__global__ void kernel_statevec_anyCtrlAnyTargDiagMatr_sub(

@JPRichings
Copy link
Copy Markdown
Contributor

JPRichings commented Apr 24, 2026

Really dumb guess based on eye balling the code is that:

INLINE qindex insertBitsWithMaskedValues(qindex number, int* bitInds, int numBits, qindex mask) {

is common in most of the statevector kernels and is called by each thread.

hmmm for loop in GPU kernel...

INLINE qindex insertBits(qindex number, int* bitIndices, int numIndices, int bitValue) {

@JPRichings
Copy link
Copy Markdown
Contributor

I also notice that this masking logic is also present in the CPU side changes as noted in the examples here: #638

with insertBitsWithMaskedValues in a parallel region in the v4 CPU side code.

The performance discrepancy in #720 also seems present in the same range of qubits < 20...

Profiling required!

@JPRichings
Copy link
Copy Markdown
Contributor

hmmm...

@JPRichings
Copy link
Copy Markdown
Contributor

JPRichings commented Apr 24, 2026

Ok so after a fair amount of digging I think we are moving data from host to device just before every kernel call that has ctrls.

Until this line,

devints deviceCtrls = util_getSorted(ctrls);

I think Ctrls are host side data but on the return of util_getSorted they are placed in a devints type which is a thrust device_vector this causes a host to device copy of data just before the call to the kernel function.

using devints = thrust::device_vector<int>;

int* getPtr(devints& qubits) {

    return thrust::raw_pointer_cast(qubits.data());
}

getPtr is called in the call to the kernel function:

getPtr(deviceCtrls), ctrls.size(), ctrlStateMask, getPtr(deviceTargs), targs.size(),

Comparing this to V3.7 is the ctrl mask was captured by value in the kernel launch as a long long int so transferred in kernel call to device (I think):

void statevec_multiControlledTwoQubitUnitary(Qureg qureg, long long int ctrlMask, int q1, int q2, ComplexMatrix4 u)

and lived in constant memory which will probably have less of a performance impact than relying on thrust to move the data on creation of the vector.

@JPRichings
Copy link
Copy Markdown
Contributor

JPRichings commented Apr 24, 2026

report1.nsys-rep.txt
nsight systems profile of a qft circuit quest V4:

image

cuda malloc: ~ 230 microsecs
cuda memcpyasync: ~ 8 microsec
Kernel call: ~ 9 microsec

Run on grace-hopper node so mem copy from host to device will be really quick and much faster than in above results.

qft output:

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 1 with probability 0.5
Qubit 4 measured in state 1 with probability 0.5
Qubit 5 measured in state 1 with probability 0.5
Qubit 6 measured in state 1 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 1 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 1 with probability 0.5
Qubit 11 measured in state 1 with probability 0.5
Qubit 12 measured in state 1 with probability 0.5
Qubit 13 measured in state 1 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 1 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 1 with probability 0.5

Final state:
|11111110111111111011>
QFT run time: 0.0691252s
Total run time: 0.386635s
Generating '/tmp/nsys-report-50e2.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:

@TysonRayJones
Copy link
Copy Markdown
Member Author

Apologies for delay - everything's on fire, I'll likely not have time to return until next weekend.
Your analysis looks right to me!

I think Ctrls are host side data but on the return of util_getSorted they are placed in a devints type which is a thrust device_vector this causes a host to device copy of data just before the call to the kernel function.

Absolutely right - a common Thrust paradigm but a gross pitfall in my opinion. I hate the implicit staging! Eager to change this if this is such an opportunity.

cuda malloc: ~ 230 microsecs
cuda memcpyasync: ~ 8 microsec
Kernel call: ~ 9 microsec

Ahh nicely done! The logic for handling controls differs between v3 and v4; the latter is asymptotically superior.
In v3, we enumerate every local amplitude and skip those failing the control conditions (permits the ctrls to be passed as a simple bitmask).
In v4, we enumerate strictly those amplitudes which satisfy the control condition. This leads to a factor 1/2^num_controls fewer iterations (a big asymptotic saving!) but so far requires the controls are kept in a sorted list (for bitwise insertion). You've pointed out re-allocating the GPU memory to stage that list is extremely non-negligible, eep!

It'd be amazing if we can perform that "enumerate only control-satisfying amps" logic using a control bitmask, possibly using these primitives?? Otherwise, we can easily maintain a persistent negligibly-sized device workspace, like we already do here.

@TysonRayJones
Copy link
Copy Markdown
Member Author

Noting though that we stil see a performance regression when there are no control qubits - could it be that a superfluous zero-size allocation is happening in that scenario? 😅 Certainly there's no algorithmic overhead because of the NumCtrls templating (I thiiiiiink)

@JPRichings
Copy link
Copy Markdown
Contributor

I think I have a fix where we move the Ctrls over to device into constant memory using cudaMemcpyToSymbol. Just profiling now. Looks fast could be wrong.

@JPRichings
Copy link
Copy Markdown
Contributor

I caution this with I could have made a horrible error in how I am doing the data transfer and messed up the correctness but based on the attached profile the performance is much improved:

image

The call to memcopy to symbol only takes 200ns!

The kernel now takes: ~ 2-3 micro seconds (probably as accessing constant read only memory is much quicker)

There is now no allocation!

Profile: report_symbol.nsys-rep.txt

Great caution until I confirm correctness on monday!

Given there is another large cuda malloc call (~250 microsec) still in the profile I think there is more performance to be had in this.

Hardware target same grace-hopper node as before.

qft output:

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 0 with probability 0.5
Qubit 1 measured in state 0 with probability 0.5
Qubit 2 measured in state 0 with probability 0.5
Qubit 3 measured in state 1 with probability 0.5
Qubit 4 measured in state 1 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 0 with probability 0.5
Qubit 7 measured in state 1 with probability 0.5
Qubit 8 measured in state 0 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 1 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 1 with probability 0.5
Qubit 13 measured in state 0 with probability 0.5
Qubit 14 measured in state 0 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 0 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|00011001011010010000>
QFT run time: 0.00877581s
Total run time: 0.318082s
Generating '/tmp/nsys-report-e8a0.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:

@JPRichings
Copy link
Copy Markdown
Contributor

Same issue in:

void gpu_statevec_anyCtrlOneTargDenseMatr_subA(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, int targ, CompMatr1 matr) {

22 instances of devints in gpu_subroutines.cpp my guess is that each one will be causing the same issue.

@JPRichings
Copy link
Copy Markdown
Contributor

JPRichings commented Apr 25, 2026

Further improvement if 1 additional kernel updated: kernel_statevec_anyCtrlOneTargDenseMatr_subA

Same qft executable same grace-hopper hardware.

Profile:
report_symbol2.nsys-rep.txt

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 0 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 1 with probability 0.5
Qubit 6 measured in state 0 with probability 0.5
Qubit 7 measured in state 1 with probability 0.5
Qubit 8 measured in state 0 with probability 0.5
Qubit 9 measured in state 0 with probability 0.5
Qubit 10 measured in state 0 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 0 with probability 0.5
Qubit 13 measured in state 0 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 0 with probability 0.5
Qubit 16 measured in state 1 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|11100101000000101010>
QFT run time: 0.00287859s
Total run time: 0.318814s

@JPRichings
Copy link
Copy Markdown
Contributor

Note to check gpu_thrust.cuh as

devints devQubits = qubits;

qreal thrust_statevec_calcProbOfMultiQubitOutcome_sub(Qureg qureg, vector<int> qubits, vector<int> outcomes) {

Causing applyQubitMeasurementAndGetProb to be impacted.

@TysonRayJones
Copy link
Copy Markdown
Member Author

Wew super neat to have this addressed, nice one! 🎉

For this PR, I'll proceed under the assumption there's no CUDA performance impact of gpu_qcomp. Still need to...

  • check this is true with HIP also (I'd bet fine!)
  • check cpu_qcomp has no performance impact (i.e. do all my tests again for CPU, with and without multithreading).

The pain of these manual cross-platform performance regression checks begs for some automated benchmarker - but it's of course extremely unsexy work. Maybe a good unitaryHack bounty, or a chore for someone interested in experimenting with AI sloppers ¯\(ツ)

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 30, 2026

begs for some automated benchmarker

We've been pondering this a bit -- it might be something we can set up through EIDF. I've also been considering reviewing all tests to confirm they're MPI-safe and then running MPI tests through EIDF as well.

All very much post-v4.3 activity though I think!

@JPRichings
Copy link
Copy Markdown
Contributor

This is the pull request that corrected the performance issues on nvidia not tested on AMD #739

@TysonRayJones
Copy link
Copy Markdown
Member Author

Note to self: can probably verify cpu_qcomp performance using Meisam's scripts in #720

@JPRichings JPRichings mentioned this pull request May 5, 2026
@TysonRayJones
Copy link
Copy Markdown
Member Author

Serial MacOS ARM tests confirm this achieves the same performance as the -Ofast flags

@TysonRayJones
Copy link
Copy Markdown
Member Author

TysonRayJones commented May 7, 2026

After disabling the relevant v4 middle-end overheads (see issue #720), v4 multithreaded performance (8 threads on a Mac M4 Pro) matches v3. Note it outperforms v3 in the small end because of the v4 autodeployer which auto-disables multithreaded process of <8-qubit statevectors 🎉

BenchTime_QFT_H

All my concerns with this (c|g)pu_qcomp refactor have been addressed, except AMD/HIP correctness/performance. Should really check my tomfoolery with stack-memory casting is okay! I might be able to get my hands on an AMD eGPU quickly though 🙏

@TysonRayJones
Copy link
Copy Markdown
Member Author

Confirmed it all runs fine on HIP! 🎉 Doing final checks of stack-alloc memory casting (which may have funky alignment pitfalls)

@TysonRayJones TysonRayJones marked this pull request as ready for review May 9, 2026 17:26
@TysonRayJones TysonRayJones merged commit 1fbaec4 into devel May 9, 2026
144 of 145 checks passed
@TysonRayJones TysonRayJones deleted the custom-complex-types branch May 9, 2026 17:32
@otbrown otbrown mentioned this pull request Jun 3, 2026
otbrown added a commit that referenced this pull request Jun 3, 2026
* Fix applyMultiStateControlledSqrtSwap argument list (#738)

Remove numControls argument from applyMultiStateControlledSqrtSwap overloaded definition taking std::vector<int>

(cherry picked from commit 9c20792)

Co-authored-by: D-Exposito <dexposito@cesga.es>

* Trotterisation test update (#728)

* tests/unit/trotterisation.cpp: updated to use REQUIRE_AGREE and cached statevecs and densmats, and both permutePaulis options

* tests/utils/compare.hpp/cpp: added setters for test epsilon

* tests/unit/trotterisation.cpp: adjusted test epsilon for quad precision imaginary time evolution tests

* tests/unit/trotterisation.cpp: moved unitary time evo test to REQUIRE_AGREE

* tests/utils/cache.hpp/cpp: added additional utilities for creating and destroying temp caches (which I guess makes them not caches?) with a set number of qubits

* tests/unit/trotterisation.cpp: updated unitary time evo test to test across deployments

* tests/unit/trotterisation.cpp: reduced number of qubits and increased number of steps to admit the possibility of testing density matrices too

* tests/unit/trotterisation.cpp: added density matrix tests

* reduce test precision

to lazily pass CPU clang quad-precision

* skip Trotter tests in paid CI

* changing varname convention

* renaming cache funcs

---------

Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com>
Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com>

* added Daniel Patino to authorlist

* CMake warn when non-release build (#742)


---------

Co-authored-by: Oliver Thomson Brown <otbrown@users.noreply.github.com>

* Stop Trotter funcs mutating PauliStrSum (#740)

Formerly, the Trotter functions (such as applyTrotterizedPauliStrSumGadget()), when passed permutePaulis=true, would randomly permutate the order of the passed PauliStrSum, mutating it and affecting the outputs of subsequent functions like reportPauliStrSum(). The function also contained superfluous memory allocs/copies equal in size to the PauliStrSum.

Now, the PauliStrSum is never mutated, and an internally allocated ordering list keeps track of the randomised permutation. We also updated the doc, renamed permutePaulis to permuteTerms, and improved validation. Note that 'permuteTerms' had not yet reached main/release, so these changes do not need to be documented in the v4.3 release notes.

* Created custom backend complex types (#729)

Created cpu_qcomp and gpu_qcomp (from a shared base_qcomp) to avoid std::complex arithmetic operators in hot loops which caused performance issues. Removed all prior compiler flags and related scaffolding attempting to mitigate the performance issue.

Also gave MSVC build the params `/Zc:preprocessor -Xcompiler=/Zc:preprocessor /bigobj` as needed for compilation of the unit tests on my windows machines.

* Replace vector<int> with SmallList (a stack array) (#743)

This is to circumvent the std::vector performance overheads visible in few-qubit simulation (responsible for a performance regression from v3; see #720), and also so that qubit lists can be passed directly to CUDA kernels without conversion (as explored in #739).

* Added few-qubit optimisations (#750)

Optimisations include:
- Adopted SmallView (const SmallList&) to avoid superfluous SmallList copies
- Made internally created matrices static
- Change accelerator dynamic function vectors to static arrays
- Exit all validators early when validation is disabled

Additional cleanup includes:
- Tidied accelerator macros (replaced param-specific macros like "numCtrls" and "numTargs" with "param")
- Fill ctrlStates vectors with default before localiser
- Renamed getBitsFromInteger to setToBitsOfInteger
- Adopted const in bitwise.hpp to better express intent

Note that the naming of SmallList and SmallView will be subsequently changed to List64 and ConstList64

* Renamed debug API functions to contain "QuEST" (#752)

* Renamed environment variables to begin with"QUEST" (#755)

* Renamed CMake vars and preprocessors (#756)

such that they all begin with QUEST, but some have additional changes

* Renamed Small(List|View) to (Const)List64 (#757)

* Defer Catch2 test discovery

so that we can compile MPI tests on systems which cannot actually run with MPI, because they are missing an MPI or UCX library file, as is witnessed in the CI (when compiling with MPICH). It's generally irksome too to trigger an execution of the test binary (which itself initialises QuEST) during build when on a HPC platform with distinct submit and compute nodes

* Enable user to take ownership of MPI (#722)

* Added ENABLE_SUBCOMM build option

* Moved from MPI_COMM_WORLD to mpiQuestComm

* Decided passing *MPI_Comm was probably overly cautious, and updated function name to comm_getMpiComm

* environment.cpp: added methods to reset rank and numNodes, and reporting for subcomm compiled

* comm_config.hpp/cpp: added comm_setMpiComm

* CMakeLists.txt: PUBLIC MPI::MPI_CXX turned out to be unhelpful, even for SubComm, because of course it enforces CXX

* Added new custom QuESTEnv initialiser which allow user to positively declare that they take ownership of MPI

* validation.cpp: updated comm_end call

* comm_config.hpp: added config.h include so COMPILE_MPI is actually defined

* subcommunicator.h/cpp: implemented QuESTEnv initialiser with custom MPI_Comm

* CMake: added subcommunicator.cpp

* comm_config.hpp: added missing config.h include...

* comm_config.cpp: explicitly initialise mpiCommQuest to MPI_COMM_NULL, updated setComm for init only workflow

* quest.h: added subcommunicator header

* CMake: added MPI to application binaries when SUBCOMM is enabled

* comm_routines.cpp: post Irecv before Isend which probably won't do anything but it makes MPI library implementers less nervous

* tests: added new env test for initCustomMpiQuESTEnv

* Added error throws to comm_config to cover new scenarios of badness with user owned MPI

* subcommunicator.cpp: updated var names to match QuEST style

* tests/unit/initialisations.cpp: slightly modified setQuregAmps test to avoid unexpected test failure due to range checking when compild in Debug configuration

* Updated validation in comm_setMpiComm

Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es>

* userOwnsMpi int->bool

* comm_config.cpp: corrected call to MPI_Comm_free

* subcommunicator.cpp: userOwnsMpi int->bool

* subcommunicator.cpp: added comm_isInit guard around comm_setMpiComm

* environment.cpp: USER_OWNS_MPI -> userOwnsMpi

* comm_init: fixed case where useDistrib = 0 and userOwnsMpi = true

* comm_init: moved (recently) misplaced MPI_Init

* AUTHORS.txt: added iarejula-bsc

* Added placeholder docstrings to new initialisers

* docs/cmake.md: added ENABLE_SUBCOMM to list of QuEST CMake vars

* Newly added COMPILE_MPI -> QUEST_COMPILE_MPI

* ENABLE_SUBCOMM -> QUEST_ENABLE_SUBCOMM

* CMake: corrected OpenMP and subcommunicator pre-processor definitions

---------

Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com>
Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es>

* Add flush and sync around prints (#763)

to reduce the likelihood of users printing from non-root nodes interrupting QuEST root output. This is not bullet-proof; we sync the active communicator rather than MPI_COMM_WORLD so the user-controlled non-participating processes may still be printing. Furthermore, even if all processes participate, some may have outstanding non-root prints that are not aggregated to the user screen by the time MPI_Barrier finishes. But these syncs greatly reduce the change of corruption, and are effectively free!

* Add GPU-aware MPICH detection

This enables CRAY MPICH platforms to leverage GPU-awareness, greatly accelerating distributed GPU simulation

Co-authored-by: JPRichings <james.richings@ed.ac.uk>

* Cleanup custom MPI flow (#762)

Important changes:
- permit user initialisation of MPI when QuEST is not distributed
- changed QuESTEnv fields bool from int (e.g. isMultithreaded)
- add user-input validation for custom MPI calls
- disambiguated comm_config.cpp concepts of "MPI is initialised" (comm_isMpiInit) from "QuEST communication is active" (comm_isActive)
- refactored comm_config.cpp flow, especially related to pre-quest-init flow (during validation)
- added Oliver's custom-MPI examples (from #712)
- moved new API functions to experimental.h
- tweaked reportQuESTEnv output grouping

* Added user-control of GPU num threads per block (#736)

Added:
- QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK CMake option
- QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK environment variable
- setQuESTNumGpuThreadsPerBlock() API function
- getQuESTNumGpuThreadsPerBlock() API function
- set_num_gpu_threads examples in examples/extended

---------

Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com>
Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com>

* Fix compiler warnings (#770)

Beware this included removing the superfluous `numControls` argument from the C++only `std::vector` overload of `applyMultiStateControlledCompMatr2`, which is technically a teeny tiny API break ¯\_(ツ)_/¯

* tests/unit/debug.cpp: updated setQuESTSeeds validation tests to include new validation (#771)

Updated number of seeds test to use a valid pointer and added a separate NULL pointer test.

* Fix Windows CI

test_free.yml: added Release config to ctest commands (#773)

---------

Co-authored-by: D-Exposito <dexposito@cesga.es>
Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com>
Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com>
Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es>
Co-authored-by: JPRichings <james.richings@ed.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants