Doc: add AICore kernel programming guide + warn against CCE topology intrinsics#962
Conversation
📝 WalkthroughWalkthroughThis PR adds parallel documentation warnings to a2a3 and a5 runtime topology intrinsic headers, advising against mixing simpler's intrinsics with CCE built-in topology intrinsics and directing users to use (args) accessor variants instead. ChangesTopology Intrinsics Compatibility Documentation
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds detailed documentation comments to intrinsic.h in both the a2a3 and a5 runtimes. The comments warn against mixing CCE built-in topology intrinsics (get_subblockid(), get_block_idx(), get_block_num()) with the custom runtime intrinsics, explaining the hardware register behavior and potential failure modes, and advising the use of the (args) variants instead. There are no review comments to address, and I have no further feedback to provide.
…y intrinsics simpler's tensormap_and_ringbuffer runtime maintains its own SPMD context (block_idx, block_num, sub_block_id) in LocalContext / GlobalContext structures referenced from the kernel args[] tail. The CCE built-in intrinsics get_subblockid(), get_block_idx(), get_block_num() (declared in kernel_operator.h / tikcfw) read AICore hardware registers that the runtime does NOT program, so a kernel that mixes them with the args-based accessors gets stale values — most importantly get_subblockid() returns 0 for BOTH AIV0 and AIV1 of every MIX cluster, causing AIV1 to silently redo AIV0's work and leaving AIV1's share of the output unwritten. This was the partial-zero failure mode in issue hw-native-sys#900 / PR hw-native-sys#899 spmd_paged_attention_highperf: a kernel ported from native CANN compiled clean, ran without error, produced half-zero output on a2a3 hardware. Resolved kernel-side in PR hw-native-sys#899 by routing all three IDs through the args-based accessors. Add three layers of documentation so the next port catches this before the same debugging round-trip: - `docs/aicore-kernel-programming.md` (new) — the kernel-author contract for this runtime: SPMD execution context, accessor functions, logical-vs-physical block_dim, the CCE-intrinsics warning with porting checklist, and pointers to working examples. Structured so future kernel-authoring topics (tensor args, FFTS sync, tiling) can grow under it. - `docs/developer-guide.md` — link from the existing Example / Test Layout section so someone reading the dev guide finds the kernel-author contract from "kernels/" without searching. - `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h` — IMPORTANT block at the top of the file with the gotcha inline (for the grep-and-read discovery path) and a back-link to the programming guide for the full context. Doc-only — no code or API changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dc3c53c to
bc87757
Compare
Summary
tensormap_and_ringbufferruntime maintains its own SPMD execution context (block_idx,block_num,sub_block_id) inLocalContext/GlobalContextstructures appended to the kernelargs[]tail. The matching accessors areget_block_idx(args),get_block_num(args),get_sub_block_id(args)insrc/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h.get_subblockid(),get_block_idx(),get_block_num()(fromkernel_operator.h/ tikcfw) read AICore hardware registers that the simpler runtime does NOT program. A kernel that uses them silently gets stale values — most notablyget_subblockid()returns0for both AIV0 and AIV1 of every MIX cluster, so AIV1 redoes AIV0's work and AIV1's share of the output is never written.spmd_paged_attention_highperfhardware run times out or produces partial zero output whilea2a3simpasses #900 / PR High performance Paged Attention A2A3 ST Test #899spmd_paged_attention_highperf: a kernel ported from native CANN compiled clean, ran without error, produced half-zero output on a2a3 hardware. Resolved kernel-side in PR High performance Paged Attention A2A3 ST Test #899.Adds three layers of documentation so the next port catches it before the same debugging round-trip:
docs/aicore-kernel-programming.md— the kernel-author contract for this runtime. Reference doc with §1 args layout, §2 SPMD execution context (logical vs physicalblock_dim), §3 the CCE-intrinsics warning + porting checklist + worked example (PR High performance Paged Attention A2A3 ST Test #899), §4 related links. Structured to grow into a fuller programming guide (tensor args, FFTS sync, tiling) as future work lands.docs/developer-guide.md— one-line link from the existing Example / Test Layout section so the kernel-author contract is discoverable from "kernels/".src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h— IMPORTANT block at the top with the gotcha inline (grep-discoverable) plus a back-link to the new guide.Doc-only — no code or API changes.
Test plan
pre-commit(check-headers / clang-format / cpplint / clang-tidy / markdownlint-cli2) — all greenintrinsic.hfiles unchanged structurally; warning is a pure C block commenttests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/andspmd_multiblock_mix/🤖 Generated with Claude Code