Ctrls performance#739
Conversation
Updates version references and fixes doxygen parsing of CMake configured header broken in QuEST-Kit#616
|
Todo:
Assumptions in this code:
|
|
Alloc size 64 qubits -- add as macro somewhere if not done already |
| const int NUM_THREADS_PER_BLOCK = 128; | ||
| const int NUM_THREADS_PER_BLOCK =128; | ||
|
|
||
| __device__ __constant__ int ctrl_device[30]; |
There was a problem hiding this comment.
TODO: use a MAX_NUM_QUBITS = 64 or something constant in constants.hpp
|
Reminder of other stuff from meeting:
|
|
Way easier to do this instead: Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only. reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future. (this is also much better than the variadic kernel idea I had earlier today) Other benefits:
|
|
Profiling to confirm but here is an extra factor of 2 over #729 (comment): |
|
Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the The pattern of Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)? |
|
I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to do first with this optimisation) capture by value in the kernel launch a c style array. The I think this is much cleaner than the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1. |
|
Finally usual caveats that correctness checking needs to take place with a full run of the test suite! |
|
Oh I see! And this is for ~12-15 qubits? Pretty neat to reduce the CUDA overhead generally and lower the "GPU is worthwhile" threshold! 🎉 There might be a chance for a more systematic change here! The... struct could actually live in a higher level than just the GPU backend, in order to avoid STL copy overheads. See #720. Whatcha think?? |
|
The results above are for 20 qubits but here are some other results for grace-hopper: 10 qubits 12 qubits 14 quits: 16 qubits: 18 qubits: Yes I agree we should move the QubitList_t struct higher and think about versions of some of the sorting routine to return directly to it so we avoid the calls to |
|
268/269 tests passing on Grace-hopper. Failed at the last hurdle: Looking at I have also noticed that there is a pattern in some of the very slow running (> 700 sec) tests: Additionally I need to look into |
|
Bit verbose but this will be useful for following up on performance: QuEST/QuEST/buildt/Testing/Temporary/CTestCostData.txt Test configuration (I think I forgot to reduce the complexity): |
|
Coming back to my comment on This needs to be thought about in more detail as its going to need some very careful thought about when data should be moved to GPU in I think given that this seems like a distinct change I will raise a separate pull request for it. |
|
After pulling devel in and rerunning tests: This needs further investigation |
|
Some of the test failures here on github are due to rocm not recognising But does look like this is supported in rocm docs: Further investigation required. Note: This doesn't scupper us it just means we might need to provide a custom type which changes for cuda and hip compiles. The use of |
|
Capture of simple struct in kernel launch tested on AMD so we might loose the benefit of |
Aw shoot! Does removing #if defined(__NVCC__)
#define _NO_COPY_OPT __grid_constant__
#elif defined(__HIP__)
#define _NO_COPY_OPT
#endifBtw the reference you link indicates:
Use of |
|
I agree macro guarding this will probably be the solution I need to implement this change in one of the kernels and retest to guarantee the performance holds and then confirm this works for Nvidia and AMD. I agree keeping the GPU capability support as old as possible is best to support a wider set of users so I will explore adding this into the solution so we don't have to bump the compute capability if we don't need to. |
|
Heads up that SmallList has reached |
| return number; | ||
| } | ||
|
|
||
| INLINE qindex insertBits(qindex number, const int* bitIndices, int numIndices, int bitValue) { |
| int ctrl_device[64]; | ||
| } ctrl_device_t; | ||
|
|
||
| struct QubitList_t { |
There was a problem hiding this comment.
QubitList_t should be made entirely redundant by SmallList (which we can still rename)
|
Is there a blocker to this beyond further performance investigation and/or updating given my upstream changes? I'm happy to make a PR to this migrating it over to (Are you seeing a performance difference between |
|
No blocker technical just human time constraints I will get to this on Monday and have agreed a strategy with Oliver for merging this with everything else. |
|
Just noting here that we defer this to v4.3.1. (Sidenote: semver says that the patch index is only for changes that fix "incorrect behaviour", but the minor version is only a MAY change if there are significant changes to the private behaviour, so where does that leave a performance patch? I suppose one could declare being much slower than it needs to be on GPU a bug, and therefore it's a bug fix...) |
|
Hehe a patch index bump is inoffensive to me - they're "free" in my mental model |
|
I think a post Unitary Hack 4.3.1 makes sense then (assuming we've already got 4.3.0 out by then!) |
* Fix applyMultiStateControlledSqrtSwap argument list (#738) Remove numControls argument from applyMultiStateControlledSqrtSwap overloaded definition taking std::vector<int> (cherry picked from commit 9c20792) Co-authored-by: D-Exposito <dexposito@cesga.es> * Trotterisation test update (#728) * tests/unit/trotterisation.cpp: updated to use REQUIRE_AGREE and cached statevecs and densmats, and both permutePaulis options * tests/utils/compare.hpp/cpp: added setters for test epsilon * tests/unit/trotterisation.cpp: adjusted test epsilon for quad precision imaginary time evolution tests * tests/unit/trotterisation.cpp: moved unitary time evo test to REQUIRE_AGREE * tests/utils/cache.hpp/cpp: added additional utilities for creating and destroying temp caches (which I guess makes them not caches?) with a set number of qubits * tests/unit/trotterisation.cpp: updated unitary time evo test to test across deployments * tests/unit/trotterisation.cpp: reduced number of qubits and increased number of steps to admit the possibility of testing density matrices too * tests/unit/trotterisation.cpp: added density matrix tests * reduce test precision to lazily pass CPU clang quad-precision * skip Trotter tests in paid CI * changing varname convention * renaming cache funcs --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> * added Daniel Patino to authorlist * CMake warn when non-release build (#742) --------- Co-authored-by: Oliver Thomson Brown <otbrown@users.noreply.github.com> * Stop Trotter funcs mutating PauliStrSum (#740) Formerly, the Trotter functions (such as applyTrotterizedPauliStrSumGadget()), when passed permutePaulis=true, would randomly permutate the order of the passed PauliStrSum, mutating it and affecting the outputs of subsequent functions like reportPauliStrSum(). The function also contained superfluous memory allocs/copies equal in size to the PauliStrSum. Now, the PauliStrSum is never mutated, and an internally allocated ordering list keeps track of the randomised permutation. We also updated the doc, renamed permutePaulis to permuteTerms, and improved validation. Note that 'permuteTerms' had not yet reached main/release, so these changes do not need to be documented in the v4.3 release notes. * Created custom backend complex types (#729) Created cpu_qcomp and gpu_qcomp (from a shared base_qcomp) to avoid std::complex arithmetic operators in hot loops which caused performance issues. Removed all prior compiler flags and related scaffolding attempting to mitigate the performance issue. Also gave MSVC build the params `/Zc:preprocessor -Xcompiler=/Zc:preprocessor /bigobj` as needed for compilation of the unit tests on my windows machines. * Replace vector<int> with SmallList (a stack array) (#743) This is to circumvent the std::vector performance overheads visible in few-qubit simulation (responsible for a performance regression from v3; see #720), and also so that qubit lists can be passed directly to CUDA kernels without conversion (as explored in #739). * Added few-qubit optimisations (#750) Optimisations include: - Adopted SmallView (const SmallList&) to avoid superfluous SmallList copies - Made internally created matrices static - Change accelerator dynamic function vectors to static arrays - Exit all validators early when validation is disabled Additional cleanup includes: - Tidied accelerator macros (replaced param-specific macros like "numCtrls" and "numTargs" with "param") - Fill ctrlStates vectors with default before localiser - Renamed getBitsFromInteger to setToBitsOfInteger - Adopted const in bitwise.hpp to better express intent Note that the naming of SmallList and SmallView will be subsequently changed to List64 and ConstList64 * Renamed debug API functions to contain "QuEST" (#752) * Renamed environment variables to begin with"QUEST" (#755) * Renamed CMake vars and preprocessors (#756) such that they all begin with QUEST, but some have additional changes * Renamed Small(List|View) to (Const)List64 (#757) * Defer Catch2 test discovery so that we can compile MPI tests on systems which cannot actually run with MPI, because they are missing an MPI or UCX library file, as is witnessed in the CI (when compiling with MPICH). It's generally irksome too to trigger an execution of the test binary (which itself initialises QuEST) during build when on a HPC platform with distinct submit and compute nodes * Enable user to take ownership of MPI (#722) * Added ENABLE_SUBCOMM build option * Moved from MPI_COMM_WORLD to mpiQuestComm * Decided passing *MPI_Comm was probably overly cautious, and updated function name to comm_getMpiComm * environment.cpp: added methods to reset rank and numNodes, and reporting for subcomm compiled * comm_config.hpp/cpp: added comm_setMpiComm * CMakeLists.txt: PUBLIC MPI::MPI_CXX turned out to be unhelpful, even for SubComm, because of course it enforces CXX * Added new custom QuESTEnv initialiser which allow user to positively declare that they take ownership of MPI * validation.cpp: updated comm_end call * comm_config.hpp: added config.h include so COMPILE_MPI is actually defined * subcommunicator.h/cpp: implemented QuESTEnv initialiser with custom MPI_Comm * CMake: added subcommunicator.cpp * comm_config.hpp: added missing config.h include... * comm_config.cpp: explicitly initialise mpiCommQuest to MPI_COMM_NULL, updated setComm for init only workflow * quest.h: added subcommunicator header * CMake: added MPI to application binaries when SUBCOMM is enabled * comm_routines.cpp: post Irecv before Isend which probably won't do anything but it makes MPI library implementers less nervous * tests: added new env test for initCustomMpiQuESTEnv * Added error throws to comm_config to cover new scenarios of badness with user owned MPI * subcommunicator.cpp: updated var names to match QuEST style * tests/unit/initialisations.cpp: slightly modified setQuregAmps test to avoid unexpected test failure due to range checking when compild in Debug configuration * Updated validation in comm_setMpiComm Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> * userOwnsMpi int->bool * comm_config.cpp: corrected call to MPI_Comm_free * subcommunicator.cpp: userOwnsMpi int->bool * subcommunicator.cpp: added comm_isInit guard around comm_setMpiComm * environment.cpp: USER_OWNS_MPI -> userOwnsMpi * comm_init: fixed case where useDistrib = 0 and userOwnsMpi = true * comm_init: moved (recently) misplaced MPI_Init * AUTHORS.txt: added iarejula-bsc * Added placeholder docstrings to new initialisers * docs/cmake.md: added ENABLE_SUBCOMM to list of QuEST CMake vars * Newly added COMPILE_MPI -> QUEST_COMPILE_MPI * ENABLE_SUBCOMM -> QUEST_ENABLE_SUBCOMM * CMake: corrected OpenMP and subcommunicator pre-processor definitions --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> * Add flush and sync around prints (#763) to reduce the likelihood of users printing from non-root nodes interrupting QuEST root output. This is not bullet-proof; we sync the active communicator rather than MPI_COMM_WORLD so the user-controlled non-participating processes may still be printing. Furthermore, even if all processes participate, some may have outstanding non-root prints that are not aggregated to the user screen by the time MPI_Barrier finishes. But these syncs greatly reduce the change of corruption, and are effectively free! * Add GPU-aware MPICH detection This enables CRAY MPICH platforms to leverage GPU-awareness, greatly accelerating distributed GPU simulation Co-authored-by: JPRichings <james.richings@ed.ac.uk> * Cleanup custom MPI flow (#762) Important changes: - permit user initialisation of MPI when QuEST is not distributed - changed QuESTEnv fields bool from int (e.g. isMultithreaded) - add user-input validation for custom MPI calls - disambiguated comm_config.cpp concepts of "MPI is initialised" (comm_isMpiInit) from "QuEST communication is active" (comm_isActive) - refactored comm_config.cpp flow, especially related to pre-quest-init flow (during validation) - added Oliver's custom-MPI examples (from #712) - moved new API functions to experimental.h - tweaked reportQuESTEnv output grouping * Added user-control of GPU num threads per block (#736) Added: - QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK CMake option - QUEST_DEFAULT_NUM_GPU_THREADS_PER_BLOCK environment variable - setQuESTNumGpuThreadsPerBlock() API function - getQuESTNumGpuThreadsPerBlock() API function - set_num_gpu_threads examples in examples/extended --------- Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> * Fix compiler warnings (#770) Beware this included removing the superfluous `numControls` argument from the C++only `std::vector` overload of `applyMultiStateControlledCompMatr2`, which is technically a teeny tiny API break ¯\_(ツ)_/¯ * tests/unit/debug.cpp: updated setQuESTSeeds validation tests to include new validation (#771) Updated number of seeds test to use a valid pointer and added a separate NULL pointer test. * Fix Windows CI test_free.yml: added Release config to ctest commands (#773) --------- Co-authored-by: D-Exposito <dexposito@cesga.es> Co-authored-by: Oliver Thomson Brown <8394906+otbrown@users.noreply.github.com> Co-authored-by: Tyson Jones <tyson.jones.input@gmail.com> Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es> Co-authored-by: JPRichings <james.richings@ed.ac.uk>
|
I agree we push this to 4.3.1 so we get 4.3 out the door. |


Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.