On CPU the rank<2> parallelization isn't optimal as the compiler doesn't optimize the code for remaining in a single line.
A 10% speedup can be obtained using the strategy below. But needs to be validated on GPU before.
Additionally the code currently uses lots of Kokkos::fence() calls which are likely not necessary if the GPU port is finished as the Kernel Kokkos calls already wait for each other and host sync isnt needed.
Kokkos::parallel_for(
Kokkos::TeamPolicy<Kokkos::DefaultHostExecutionSpace>(num_black_circles, Kokkos::AUTO),
KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
const int circle_task = team.league_rank();
const int i_r = start_black_circles + circle_task * 2;
Kokkos::parallel_for(
Kokkos::TeamThreadRange(team, grid.ntheta()),
[&](const int i_theta) {
nodeApplyAscOrthoCircleTake(i_r, i_theta, ...);
});
});
On CPU the rank<2> parallelization isn't optimal as the compiler doesn't optimize the code for remaining in a single line.
A 10% speedup can be obtained using the strategy below. But needs to be validated on GPU before.
Additionally the code currently uses lots of Kokkos::fence() calls which are likely not necessary if the GPU port is finished as the Kernel Kokkos calls already wait for each other and host sync isnt needed.