feat(examples): add L3 ring allreduce skeleton (chunked RS+AG) by georgebisbas · Pull Request #972 · hw-native-sys/simpler

georgebisbas · 2026-06-02T09:46:08Z

Adds a new L3 example allreduce_ring_distributed/ implementing chunked ring allreduce (P−1 reduce-scatter + P−1 allgather rounds) over the HCCL communication window.

Verification status

Phase	Status
A — L3 driver	NPU P=2 PASS @ `8fb6316e` (a2a3, devices 5–6; temporary mesh-binary fallback)
B/C — Ring RS+AG kernel	Kernel + `allreduce_ring_common.hpp` wired; `main.py` compiles ring sources; NPU verify pending

What changed

New: examples/workers/l3/allreduce_ring_distributed/

Component	Description
`kernels/aiv/allreduce_ring_kernel.cpp`	Stage-in → (P−1) RS ring steps → (P−1) AG ring steps → stage-out
`kernels/aiv/allreduce_ring_common.hpp`	Chunk copy, per-round barrier, left-neighbour recv helpers
`kernels/orchestration/allreduce_ring_orch.cpp`	Same 3-tensor + 2-scalar arg layout as mesh allreduce
`main.py`	Worker/orchestration glue; ring scratch sizing; same golden as mesh
`test_allreduce.py`	P=2 + P=4 pytest (split for a5 2-NPU CI, same pattern as mesh)

Edit: examples/workers/l3/README.md — documents the new example.

Scratch layout (per rank, in window):

P chunk slots (partitioned input / working buffers)
1 exchange publish slot (one chunk, visible to left neighbour)
Signal tail: 2(P−1) × kMaxSupportedRanks int32 slots (one row per round)

Golden: identical to mesh — output[i] = sum_r (i + r×100) for all ranks (256-element float32 vectors; ALLREDUCE_COUNT must divide nranks).

NPU re-verify (after ring kernel wired)

python examples/workers/l3/allreduce_distributed/main.py -p a2a3 -d 5-6   # baseline
python examples/workers/l3/allreduce_ring_distributed/main.py -p a2a3 -d 5-6
python examples/workers/l3/allreduce_ring_distributed/main.py -p a2a3sim -d 0-3

New allreduce_ring_distributed example: P-1 reduce-scatter and P-1 allgather ring rounds over HCCL window exchange slots, with per-round notify/wait barriers. Same golden as mesh allreduce_distributed. P=2 and P=4 pytest; default CLI device 0-3.

coderabbitai · 2026-06-02T09:46:16Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 61f4dbb5-3fcd-4e15-8ce1-981bba080cc6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a distributed ring allreduce implementation under examples/workers/l3/, consisting of a chunked reduce-scatter and allgather kernel, orchestration code, a Python driver, and tests. Feedback on the changes highlights critical issues in the kernel: signal slots in the scratch buffer must be explicitly zero-initialized to prevent barrier failures, and additional barriers are required in both the reduce-scatter and allgather loops to resolve Write-After-Read (WAR) hazards. Consequently, the SIGNAL_SLOTS allocation in the Python driver needs to be doubled to prevent out-of-bounds indexing. Additionally, the all-to-all barrier can be optimized to a ring-specific neighbor synchronization to reduce overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-02T09:48:55Z

+    __gm__ int32_t *signal_base =
+        reinterpret_cast<__gm__ int32_t *>(scratch + static_cast<size_t>((nranks + 1) * chunk_elems));


Following the general rule:

Do not assume that allocated shared memory or device memory is zero-initialized. Always explicitly initialize all fields (such as thread/core counts and mapping arrays) to prevent garbage values from causing segmentation faults or undefined behavior.

The signal slots in the scratch buffer are not guaranteed to be zero-initialized. If they contain positive garbage values, TWAIT with WaitCmp::GE to 1 will immediately succeed, completely breaking the barrier synchronization. We must explicitly initialize the local signal slots to 0 at the start of the kernel.

__gm__ int32_t *signal_base = reinterpret_cast<__gm__ int32_t *>(scratch + static_cast<size_t>((nranks + 1) * chunk_elems)); // Explicitly initialize local signal slots to 0 to prevent garbage values from causing undefined behavior in barriers. const int total_signal_slots = 4 * (nranks - 1) * kMaxSupportedRanks; for (int i = 0; i < total_signal_slots; ++i) { signal_base[i] = 0; } pipe_barrier(PIPE_ALL);

References

Do not assume that allocated shared memory or device memory is zero-initialized. Always explicitly initialize all fields (such as thread/core counts and mapping arrays) to prevent garbage values from causing segmentation faults or undefined behavior.

gemini-code-assist · 2026-06-02T09:48:55Z

+        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
+        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
+        pipe_barrier(PIPE_ALL);


There is a critical Write-After-Read (WAR) hazard (anti-dependency) across ranks in the reduce-scatter loop.

In step $S$, Rank $R$ reads from Rank $R-1$'s exchange slot. However, if Rank $R-1$ is faster, it can proceed to step $S+1$ and overwrite its exchange slot before Rank $R$ has finished reading it. The single RoundBarrier at the start of the step only ensures that the write from the previous step is complete, but does not prevent a fast writer from overwriting the buffer before a slow reader finishes.

To resolve this, we must add a second barrier at the end of each step to ensure all reads are complete before any rank proceeds to overwrite its exchange slot in the next step.

set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); pipe_barrier(PIPE_ALL); RoundBarrier(commCtx, signal_base, my_rank, nranks, round++);

gemini-code-assist · 2026-06-02T09:48:55Z

+        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
+        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
+        pipe_barrier(PIPE_ALL);


Similarly to the reduce-scatter loop, there is a Write-After-Read (WAR) hazard in the allgather loop. A fast rank can proceed to the next step and overwrite its exchange slot before its right neighbor has finished reading it.

Adding a second barrier at the end of each step ensures that all ranks have finished reading the current step's exchange slots before any rank proceeds to overwrite them.

set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); pipe_barrier(PIPE_ALL); RoundBarrier(commCtx, signal_base, my_rank, nranks, round++);

gemini-code-assist · 2026-06-02T09:48:55Z

+K_MAX_SUPPORTED_RANKS = 16
+CHUNK_MAX = ALLREDUCE_COUNT // 2  # largest chunk (P=2)
+SCRATCH_FLOAT_ELEMS = (K_MAX_SUPPORTED_RANKS + 1) * CHUNK_MAX
+SIGNAL_SLOTS = 2 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS


Since we need to introduce a second barrier per step in both the reduce-scatter and allgather loops to prevent WAR hazards, the total number of barrier rounds per execution doubles from $2(P-1)$ to $4(P-1)$.

We must double the allocated SIGNAL_SLOTS accordingly to prevent out-of-bounds signal indexing in the scratch buffer.

Suggested change

SIGNAL_SLOTS = 2 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS

SIGNAL_SLOTS = 4 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS

gemini-code-assist · 2026-06-02T09:48:55Z

+AICORE inline void RoundBarrier(__gm__ CommContext *ctx, __gm__ int32_t *signal_base, int my_rank, int nranks,
+                                int round) {
+    __gm__ int32_t *round_signals = signal_base + round * kMaxSupportedRanks;
+    for (int peer = 0; peer < nranks; ++peer) {
+        if (peer == my_rank) {
+            continue;
+        }
+        __gm__ int32_t *remote_signal = CommRemotePtr(ctx, round_signals + my_rank, peer);
+        pto::comm::Signal sig(remote_signal);
+        pto::comm::TNOTIFY(sig, (int32_t)1, pto::comm::NotifyOp::AtomicAdd);
+    }
+    for (int peer = 0; peer < nranks; ++peer) {
+        if (peer == my_rank) {
+            continue;
+        }
+        pto::comm::Signal sig(round_signals + peer);
+        pto::comm::TWAIT(sig, (int32_t)1, pto::comm::WaitCmp::GE);
+    }
+    pipe_barrier(PIPE_ALL);
+}


In a ring topology, each rank only interacts with its immediate left and right neighbors (reading from left, writing to right).

Using an all-to-all barrier (RoundBarrier with all peers) introduces unnecessary synchronization overhead of $O(P)$ notifications and waits per rank. We can optimize this to $O(1)$ by only notifying the right neighbor and waiting for the left neighbor.

AICORE inline void RoundBarrier(__gm__ CommContext *ctx, __gm__ int32_t *signal_base, int my_rank, int nranks, int round) { __gm__ int32_t *round_signals = signal_base + round * kMaxSupportedRanks; int right = (my_rank + 1) % nranks; int left = (my_rank - 1 + nranks) % nranks; // Notify right neighbor __gm__ int32_t *remote_signal = CommRemotePtr(ctx, round_signals + my_rank, right); pto::comm::Signal sig_remote(remote_signal); pto::comm::TNOTIFY(sig_remote, (int32_t)1, pto::comm::NotifyOp::AtomicAdd); // Wait for left neighbor pto::comm::Signal sig_local(round_signals + left); pto::comm::TWAIT(sig_local, (int32_t)1, pto::comm::WaitCmp::GE); pipe_barrier(PIPE_ALL); }

- Use one signal row with cumulative TWAIT generations instead of per-round rows or mid-barrier slot resets that race with TNOTIFY. - Add stage-in device barrier and zero-init signals once per run. - Match mesh local-write, barrier, then remote-read for exchange chunks. - Allocate a single kMaxSupportedRanks int32 tail in main.py.

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

georgebisbas marked this pull request as draft June 2, 2026 09:54

georgebisbas force-pushed the feat/l3-ring-allreduce-skeleton branch from 50aff1b to 991c9db Compare June 2, 2026 13:11

georgebisbas force-pushed the feat/l3-ring-allreduce-skeleton branch from 991c9db to cb7ab81 Compare June 2, 2026 13:38

georgebisbas closed this Jun 2, 2026

georgebisbas mentioned this pull request Jun 2, 2026

feat(examples): L3 ring allreduce (chunked RS+AG, a2a3 verified) #975

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): add L3 ring allreduce skeleton (chunked RS+AG)#972

feat(examples): add L3 ring allreduce skeleton (chunked RS+AG)#972
georgebisbas wants to merge 2 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-ring-allreduce-skeleton

georgebisbas commented Jun 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		__gm__ int32_t *signal_base =
		reinterpret_cast<__gm__ int32_t >(scratch + static_cast<size_t>((nranks + 1) chunk_elems));

	SIGNAL_SLOTS = 2 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS
	SIGNAL_SLOTS = 4 * (K_MAX_SUPPORTED_RANKS - 1) * K_MAX_SUPPORTED_RANKS

Conversation

georgebisbas commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verification status

What changed

NPU re-verify (after ring kernel wired)

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

georgebisbas commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading