Investigate Kokkos::TeamThreadRange instead of Rank<2>

On CPU the rank<2> parallelization isn't optimal as the compiler doesn't optimize the code for remaining in a single line.
A 10% speedup can be obtained using the strategy below. But needs to be validated on GPU before.

Additionally the code currently uses lots of Kokkos::fence() calls which are likely not necessary if the GPU port is finished as the Kernel Kokkos calls already wait for each other and host sync isnt needed.

```

Kokkos::parallel_for(
    Kokkos::TeamPolicy<Kokkos::DefaultHostExecutionSpace>(num_black_circles, Kokkos::AUTO),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
        const int circle_task = team.league_rank();
        const int i_r = start_black_circles + circle_task * 2;
        Kokkos::parallel_for(
            Kokkos::TeamThreadRange(team, grid.ntheta()),
            [&](const int i_theta) {
                nodeApplyAscOrthoCircleTake(i_r, i_theta, ...);
            });
    });

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Kokkos::TeamThreadRange instead of Rank<2> #283

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate Kokkos::TeamThreadRange instead of Rank<2> #283

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions