Skip to content

Investigate Kokkos::TeamThreadRange instead of Rank<2> #283

@julianlitz

Description

@julianlitz

On CPU the rank<2> parallelization isn't optimal as the compiler doesn't optimize the code for remaining in a single line.
A 10% speedup can be obtained using the strategy below. But needs to be validated on GPU before.

Additionally the code currently uses lots of Kokkos::fence() calls which are likely not necessary if the GPU port is finished as the Kernel Kokkos calls already wait for each other and host sync isnt needed.


Kokkos::parallel_for(
    Kokkos::TeamPolicy<Kokkos::DefaultHostExecutionSpace>(num_black_circles, Kokkos::AUTO),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
        const int circle_task = team.league_rank();
        const int i_r = start_black_circles + circle_task * 2;
        Kokkos::parallel_for(
            Kokkos::TeamThreadRange(team, grid.ntheta()),
            [&](const int i_theta) {
                nodeApplyAscOrthoCircleTake(i_r, i_theta, ...);
            });
    });

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions