Skip to content

feat(examples): add L3 ring allreduce skeleton (chunked RS+AG)#972

Closed
georgebisbas wants to merge 2 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-ring-allreduce-skeleton
Closed

feat(examples): add L3 ring allreduce skeleton (chunked RS+AG)#972
georgebisbas wants to merge 2 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-ring-allreduce-skeleton

Conversation

@georgebisbas
Copy link
Copy Markdown
Contributor

@georgebisbas georgebisbas commented Jun 2, 2026

Adds a new L3 example allreduce_ring_distributed/ implementing chunked ring allreduce (P−1 reduce-scatter + P−1 allgather rounds) over the HCCL communication window.

Verification status

Phase Status
A — L3 driver NPU P=2 PASS @ 8fb6316e (a2a3, devices 5–6; temporary mesh-binary fallback)
B/C — Ring RS+AG kernel Kernel + allreduce_ring_common.hpp wired; main.py compiles ring sources; NPU verify pending

What changed

New: examples/workers/l3/allreduce_ring_distributed/

Component Description
kernels/aiv/allreduce_ring_kernel.cpp Stage-in → (P−1) RS ring steps → (P−1) AG ring steps → stage-out
kernels/aiv/allreduce_ring_common.hpp Chunk copy, per-round barrier, left-neighbour recv helpers
kernels/orchestration/allreduce_ring_orch.cpp Same 3-tensor + 2-scalar arg layout as mesh allreduce
main.py Worker/orchestration glue; ring scratch sizing; same golden as mesh
test_allreduce.py P=2 + P=4 pytest (split for a5 2-NPU CI, same pattern as mesh)

Edit: examples/workers/l3/README.md — documents the new example.

Scratch layout (per rank, in window):

  • P chunk slots (partitioned input / working buffers)
  • 1 exchange publish slot (one chunk, visible to left neighbour)
  • Signal tail: 2(P−1) × kMaxSupportedRanks int32 slots (one row per round)

Golden: identical to mesh — output[i] = sum_r (i + r×100) for all ranks (256-element float32 vectors; ALLREDUCE_COUNT must divide nranks).

NPU re-verify (after ring kernel wired)

python examples/workers/l3/allreduce_distributed/main.py -p a2a3 -d 5-6   # baseline
python examples/workers/l3/allreduce_ring_distributed/main.py -p a2a3 -d 5-6
python examples/workers/l3/allreduce_ring_distributed/main.py -p a2a3sim -d 0-3

New allreduce_ring_distributed example: P-1 reduce-scatter and P-1
allgather ring rounds over HCCL window exchange slots, with per-round
notify/wait barriers. Same golden as mesh allreduce_distributed.
P=2 and P=4 pytest; default CLI device 0-3.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 61f4dbb5-3fcd-4e15-8ce1-981bba080cc6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a distributed ring allreduce implementation under examples/workers/l3/, consisting of a chunked reduce-scatter and allgather kernel, orchestration code, a Python driver, and tests. Feedback on the changes highlights critical issues in the kernel: signal slots in the scratch buffer must be explicitly zero-initialized to prevent barrier failures, and additional barriers are required in both the reduce-scatter and allgather loops to resolve Write-After-Read (WAR) hazards. Consequently, the SIGNAL_SLOTS allocation in the Python driver needs to be doubled to prevent out-of-bounds indexing. Additionally, the all-to-all barrier can be optimized to a ring-specific neighbor synchronization to reduce overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +133 to +134
__gm__ int32_t *signal_base =
reinterpret_cast<__gm__ int32_t *>(scratch + static_cast<size_t>((nranks + 1) * chunk_elems));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Following the general rule:

Do not assume that allocated shared memory or device memory is zero-initialized. Always explicitly initialize all fields (such as thread/core counts and mapping arrays) to prevent garbage values from causing segmentation faults or undefined behavior.

The signal slots in the scratch buffer are not guaranteed to be zero-initialized. If they contain positive garbage values, TWAIT with WaitCmp::GE to 1 will immediately succeed, completely breaking the barrier synchronization. We must explicitly initialize the local signal slots to 0 at the start of the kernel.

    __gm__ int32_t *signal_base =
        reinterpret_cast<__gm__ int32_t *>(scratch + static_cast<size_t>((nranks + 1) * chunk_elems));

    // Explicitly initialize local signal slots to 0 to prevent garbage values from causing undefined behavior in barriers.
    const int total_signal_slots = 4 * (nranks - 1) * kMaxSupportedRanks;
    for (int i = 0; i < total_signal_slots; ++i) {
        signal_base[i] = 0;
    }
    pipe_barrier(PIPE_ALL);
References
  1. Do not assume that allocated shared memory or device memory is zero-initialized. Always explicitly initialize all fields (such as thread/core counts and mapping arrays) to prevent garbage values from causing segmentation faults or undefined behavior.

Comment on lines +177 to +179
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
pipe_barrier(PIPE_ALL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a critical Write-After-Read (WAR) hazard (anti-dependency) across ranks in the reduce-scatter loop.

In step $S$, Rank $R$ reads from Rank $R-1$'s exchange slot. However, if Rank $R-1$ is faster, it can proceed to step $S+1$ and overwrite its exchange slot before Rank $R$ has finished reading it. The single RoundBarrier at the start of the step only ensures that the write from the previous step is complete, but does not prevent a fast writer from overwriting the buffer before a slow reader finishes.

To resolve this, we must add a second barrier at the end of each step to ensure all reads are complete before any rank proceeds to overwrite its exchange slot in the next step.

        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        pipe_barrier(PIPE_ALL);
        RoundBarrier(commCtx, signal_base, my_rank, nranks, round++);

Comment on lines +200 to +202
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
pipe_barrier(PIPE_ALL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similarly to the reduce-scatter loop, there is a Write-After-Read (WAR) hazard in the allgather loop. A fast rank can proceed to the next step and overwrite its exchange slot before its right neighbor has finished reading it.

Adding a second barrier at the end of each step ensures that all ranks have finished reading the current step's exchange slots before any rank proceeds to overwrite them.

        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        pipe_barrier(PIPE_ALL);
        RoundBarrier(commCtx, signal_base, my_rank, nranks, round++);

K_MAX_SUPPORTED_RANKS = 16
CHUNK_MAX = ALLREDUCE_COUNT // 2 # largest chunk (P=2)
SCRATCH_FLOAT_ELEMS = (K_MAX_SUPPORTED_RANKS + 1) * CHUNK_MAX
SIGNAL_SLOTS = 2 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since we need to introduce a second barrier per step in both the reduce-scatter and allgather loops to prevent WAR hazards, the total number of barrier rounds per execution doubles from $2(P-1)$ to $4(P-1)$.

We must double the allocated SIGNAL_SLOTS accordingly to prevent out-of-bounds signal indexing in the scratch buffer.

Suggested change
SIGNAL_SLOTS = 2 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS
SIGNAL_SLOTS = 4 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS

Comment on lines +65 to +84
AICORE inline void RoundBarrier(__gm__ CommContext *ctx, __gm__ int32_t *signal_base, int my_rank, int nranks,
int round) {
__gm__ int32_t *round_signals = signal_base + round * kMaxSupportedRanks;
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) {
continue;
}
__gm__ int32_t *remote_signal = CommRemotePtr(ctx, round_signals + my_rank, peer);
pto::comm::Signal sig(remote_signal);
pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
}
for (int peer = 0; peer < nranks; ++peer) {
if (peer == my_rank) {
continue;
}
pto::comm::Signal sig(round_signals + peer);
pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);
}
pipe_barrier(PIPE_ALL);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In a ring topology, each rank only interacts with its immediate left and right neighbors (reading from left, writing to right).

Using an all-to-all barrier (RoundBarrier with all peers) introduces unnecessary synchronization overhead of $O(P)$ notifications and waits per rank. We can optimize this to $O(1)$ by only notifying the right neighbor and waiting for the left neighbor.

AICORE inline void RoundBarrier(__gm__ CommContext *ctx, __gm__ int32_t *signal_base, int my_rank, int nranks,
                                int round) {
    __gm__ int32_t *round_signals = signal_base + round * kMaxSupportedRanks;
    int right = (my_rank + 1) % nranks;
    int left = (my_rank - 1 + nranks) % nranks;

    // Notify right neighbor
    __gm__ int32_t *remote_signal = CommRemotePtr(ctx, round_signals + my_rank, right);
    pto::comm::Signal sig_remote(remote_signal);
    pto::comm::TNOTIFY(sig_remote, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);

    // Wait for left neighbor
    pto::comm::Signal sig_local(round_signals + left);
    pto::comm::TWAIT(sig_local, (int32_t)1, pto::comm::WaitCmp::GE);

    pipe_barrier(PIPE_ALL);
}

@georgebisbas georgebisbas marked this pull request as draft June 2, 2026 09:54
@georgebisbas georgebisbas force-pushed the feat/l3-ring-allreduce-skeleton branch from 50aff1b to 991c9db Compare June 2, 2026 13:11
- Use one signal row with cumulative TWAIT generations instead of
  per-round rows or mid-barrier slot resets that race with TNOTIFY.
- Add stage-in device barrier and zero-init signals once per run.
- Match mesh local-write, barrier, then remote-read for exchange chunks.
- Allocate a single kMaxSupportedRanks int32 tail in main.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant