High performance Paged Attention A2A3 ST Test#899
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces a complete high-performance paged attention kernel for GQA decoding with CPU-sim and hardware implementations, tiling logic, orchestration, and comprehensive testing and benchmarking support across Python and C++. ChangesPaged Attention Highperf
🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a305ea5 to
4f2365b
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a high-performance SPMD paged attention implementation, including C++ kernels, tiling logic, benchmarks, and correctness tests, alongside compiler updates to dynamically include CANN directories. The review feedback recommends adding defensive input validation in the tiling and kernel code to prevent division-by-zero and null pointer dereferences, checking environment variables in the compilation script, and adding future annotations in the test script.
4f2365b to
621782f
Compare
Fixes hw-native-sys#900 The AICore kernel loader (`simpler_setup/elf_parser.py`) silently dropped `.text._Z*` group sections (out-of-line template instantiations) and `.rela.text*` relocations when extracting a `.text` payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue hw-native-sys#831 / PR hw-native-sys#830) or silently-wrong partial output (issue hw-native-sys#900). Both symptoms are extremely hard to root-cause from the runtime error alone. This change is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the `.o` contains `.text._Z*` or any `.rela.text*` entries. The error names the offending sections and points at the `always_inline` kernel-side workaround. The literal-`.text` extraction path is otherwise unchanged — working kernels stay byte-identical (verified against the PA highperf `.o` from PR hw-native-sys#899 and the existing fully-inlined kernels). Loader-side relocation application is a separable follow-up; this PR just closes the silent-failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes #900 The AICore kernel loader (`simpler_setup/elf_parser.py`) silently dropped `.text._Z*` group sections (out-of-line template instantiations) and `.rela.text*` relocations when extracting a `.text` payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue #831 / PR #830) or silently-wrong partial output (issue #900). Both symptoms are extremely hard to root-cause from the runtime error alone. This change is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the `.o` contains `.text._Z*` or any `.rela.text*` entries. The error names the offending sections and points at the `always_inline` kernel-side workaround. The literal-`.text` extraction path is otherwise unchanged — working kernels stay byte-identical (verified against the PA highperf `.o` from PR #899 and the existing fully-inlined kernels). Loader-side relocation application is a separable follow-up; this PR just closes the silent-failure mode. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/aic/paged_attention_highperf.cpp`:
- Around line 146-149: The AICORE path in tensor_data currently returns only the
allocation base (tensor->buffer.addr) and ignores Tensor::start_offset so
sliced/view tensors point to the wrong GM region; update tensor_data (the
__aicore__ function) to add the tensor->start_offset to the returned pointer
(i.e., return the buffer base plus start_offset) matching the __CPU_SIM behavior
and ensuring the pointer arithmetic uses the correct byte offset/type
conversions.
- Around line 88-100: The code assumes a uniform seq_len via blocks_per_batch =
key_t->shapes[0] / batch and then clamps block_col against max_blocks_per_query;
instead derive the sequence length per query from the per-query
block-table/metadata and use that per-batch-item when iterating. Specifically,
replace use of the global blocks_per_batch/seq_len with a per-query blocks_count
(e.g., read the valid block count or end index for each b from the block-table
metadata or a companion array) and compute seq_len = blocks_count * block_size
for that b, and when computing block_col or indexing block_table[b *
max_blocks_per_query + block_col] clamp against that per-query blocks_count
rather than the global max_blocks_per_query so heterogeneous paged contexts use
their own valid lengths.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/bench_pa_performance.py`:
- Around line 212-216: The benchmark measures asymmetric scopes: CustomPARunner
pre-allocates tiling/workspace/output before timing but forward_incre_flash /
npu_incre_flash_attention is timed end-to-end, so the reported speedup is
misleading; fix by making scopes comparable—either move the CustomPARunner setup
(tiling/workspace/output allocation) into the timed loop so
benchmark_with_events measures the same per-iteration work as
forward_incre_flash, or pre-allocate and reuse the IFA path's workspace/outputs
so benchmark_with_events only times the core kernel call for
forward_incre_flash; alternatively, if you intend to show amortized steady-state
for CustomPARunner, explicitly relabel ms_custom as an amortized/steady-state
metric and document that the setup is excluded. Ensure you reference
CustomPARunner, benchmark_with_events, forward_incre_flash, and
npu_incre_flash_attention when making the change so both paths time equivalent
scopes.
- Around line 64-66: pack_kv_to_paged currently assumes L is an exact multiple
of block_size and will later fail with a generic view() error; add an explicit
guard at the start of pack_kv_to_paged to check if L % block_size != 0 and raise
a clear ValueError (or AssertionError) that includes L and block_size (and
optionally nkv) in the message so unsupported non-block-aligned kv lengths fail
fast and are easy to debug.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/compile.sh`:
- Around line 4-16: Add an explicit preflight check for ASCEND_TOOLKIT_HOME at
the top of compile.sh (immediately after SCRIPT_DIR is set) to fail fast with a
clear error: test that the environment variable ASCEND_TOOLKIT_HOME is set and
non-empty (e.g., using parameter expansion or [ -z "${ASCEND_TOOLKIT_HOME:-}"
]), print a descriptive message to stderr, and exit 1 if it is unset so the
subsequent bisheng invocation and all -I"${ASCEND_TOOLKIT_HOME}/..." includes
cannot accidentally trigger a cryptic failure via set -u.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_accuracy.py`:
- Around line 166-217: Hoist all non-launch work out of run_custom: compute
make_pa_nd_decode_tiling(...) once on setup, allocate workspace buffers returned
by workspace_sizes(...) (the ws_buf logic), pre-create o = torch.zeros(...),
null = empty_buf(device), and materialize bt_npu = bt.to(device) in a new setup
function (or return a prepared "launch_state"); then change run_custom to accept
that prepared state and only perform the synchronizations and call _launch(lib,
eff_bd, stream, q, k_page, v_page, bt_npu, null, o, s_gm, p_gm, o_tmp_gm, go_gm,
o_core, l_gm, gm_k16, gm_v16, tiling). Apply the same hoisting refactor to the
analogous block referenced at lines 222-288 so the timed kernel launch does not
rebuild tiling or reallocate buffers.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`:
- Around line 260-261: indices is a sorted_rank -> original_index mapping, but
the kernel needs the inverse (original_index -> sorted_rank) when writing
per-sequence metadata (e.g., slot 13 and the block at the same area referenced
by indices between the earlier mentioned range); compute an inverse permutation
(call it inv_indices) of length batch such that inv_indices[original_index] =
sorted_rank, and use inv_indices wherever you currently index by indices to
write per-sequence entries (including the slot 13 write and the other writes in
the block covering the same region), ensuring all per-sequence metadata is
associated with the original batch index.
- Around line 429-452: workspace_sizes currently ignores the batch parameter
causing l and o_core_tmp to under-allocate relative to tiling offsets (see
make_pa_nd_decode_tiling which grows addr_l and addr_ofd per sequence). Fix
workspace_sizes by incorporating batch into the computation of o_core and l_size
(e.g. multiply existing o_core and l_size formulas by batch or otherwise scale
them by the number of sequences per batch), keep the existing int(...) and
SPLITKV_RATIO usage and preserve the max(16, ...) guards for "o_core_tmp" and
"l". This ensures "l" and "o_core_tmp" (referenced in make_pa_nd_decode_tiling
via addr_l/addr_ofd) are sized to cover per-batch growth and prevents OOB writes
in split-KV cases.
- Around line 310-317: The tiling_key computation can emit values
(128/129/144/145) that pa_entry.cce doesn't handle, causing the kernel to fall
through; modify the is_split_block logic in pa_tiling.py so it never sets the
high-bit used to produce 128+ values when the kernel path doesn't support them
(i.e., force is_split_block = 0 or add a guard that clears is_split_block for
the block_size/head_dim/head_dim_v case), then recompute tiling_key =
(is_split_block << 7) + (is_split_key << 4) + type_key so only keys 0,1,16,17
are produced and dispatched by the kernel.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/README.md`:
- Around line 3-7: The README’s run instructions omit required environment vars;
update README.md (the usage block near the bash commands) to document and show
exporting ASCEND_TOOLKIT_HOME before invoking compile.sh, and mention any other
env vars used by compile.sh or the Python scripts (e.g., ASCEND_TOOLKIT_HOME is
required for compile.sh and any runtime vars needed by pa_accuracy.py /
bench_pa_performance.py); include a short example export line and a note that
users must set the path to their Ascend toolkit installation.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`:
- Around line 27-32: Add an explicit fast-fail check for non-divisible KV
lengths before the block packing logic: verify that seq_len % block_size == 0
(where seq_len is derived from k_dense and block_size is used to compute
num_blocks) and raise a clear error (e.g., ValueError or assert) with a message
explaining the block-size contract so that invalid inputs to the
_pack_kv_to_paged()/k_page packing sequence fail with an informative error
instead of a later reshape/view error.
- Around line 59-61: The test assumes equal mapping of Q heads to KV heads but
never checks divisibility; before computing heads_per_kv (after extracting
num_heads and num_kv_heads from q.shape and k_page), add an assertion that
num_heads % num_kv_heads == 0 (or raise a clear error) to ensure heads_per_kv =
num_heads // num_kv_heads is integral and avoid out‑of‑bounds or incorrect head
mapping in the subsequent logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 716ecad4-10f3-4752-b279-699b28233f0a
📒 Files selected for processing (13)
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/.gitignoretests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/README.mdtests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/aic/paged_attention_highperf.cpptests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/bench_pa_performance.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/compile.shtests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_entry.ccetests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.ccetests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/orchestration/paged_attention_highperf_orch.cpptests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_accuracy.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/paged_attention_wrapper.cpptests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/tiling/pa_tiling_struct.htests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
…y intrinsics simpler's tensormap_and_ringbuffer runtime maintains its own SPMD context (block_idx, block_num, sub_block_id) in LocalContext / GlobalContext structures referenced from the kernel args[] tail. The CCE built-in intrinsics get_subblockid(), get_block_idx(), get_block_num() (declared in kernel_operator.h / tikcfw) read AICore hardware registers that the runtime does NOT program, so a kernel that mixes them with the args-based accessors gets stale values — most importantly get_subblockid() returns 0 for BOTH AIV0 and AIV1 of every MIX cluster, causing AIV1 to silently redo AIV0's work and leaving AIV1's share of the output unwritten. This was the partial-zero failure mode in issue hw-native-sys#900 / PR hw-native-sys#899 spmd_paged_attention_highperf: a kernel ported from native CANN compiled clean, ran without error, produced half-zero output on a2a3 hardware. Resolved kernel-side in PR hw-native-sys#899 by routing all three IDs through the args-based accessors. Add three layers of documentation so the next port catches this before the same debugging round-trip: - `docs/aicore-kernel-programming.md` (new) — the kernel-author contract for this runtime: SPMD execution context, accessor functions, logical-vs-physical block_dim, the CCE-intrinsics warning with porting checklist, and pointers to working examples. Structured so future kernel-authoring topics (tensor args, FFTS sync, tiling) can grow under it. - `docs/developer-guide.md` — link from the existing Example / Test Layout section so someone reading the dev guide finds the kernel-author contract from "kernels/" without searching. - `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h` — IMPORTANT block at the top of the file with the gotcha inline (for the grep-and-read discovery path) and a back-link to the programming guide for the full context. Doc-only — no code or API changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y intrinsics (#962) simpler's tensormap_and_ringbuffer runtime maintains its own SPMD context (block_idx, block_num, sub_block_id) in LocalContext / GlobalContext structures referenced from the kernel args[] tail. The CCE built-in intrinsics get_subblockid(), get_block_idx(), get_block_num() (declared in kernel_operator.h / tikcfw) read AICore hardware registers that the runtime does NOT program, so a kernel that mixes them with the args-based accessors gets stale values — most importantly get_subblockid() returns 0 for BOTH AIV0 and AIV1 of every MIX cluster, causing AIV1 to silently redo AIV0's work and leaving AIV1's share of the output unwritten. This was the partial-zero failure mode in issue #900 / PR #899 spmd_paged_attention_highperf: a kernel ported from native CANN compiled clean, ran without error, produced half-zero output on a2a3 hardware. Resolved kernel-side in PR #899 by routing all three IDs through the args-based accessors. Add three layers of documentation so the next port catches this before the same debugging round-trip: - `docs/aicore-kernel-programming.md` (new) — the kernel-author contract for this runtime: SPMD execution context, accessor functions, logical-vs-physical block_dim, the CCE-intrinsics warning with porting checklist, and pointers to working examples. Structured so future kernel-authoring topics (tensor args, FFTS sync, tiling) can grow under it. - `docs/developer-guide.md` — link from the existing Example / Test Layout section so someone reading the dev guide finds the kernel-author contract from "kernels/" without searching. - `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h` — IMPORTANT block at the top of the file with the gotcha inline (for the grep-and-read discovery path) and a back-link to the programming guide for the full context. Doc-only — no code or API changes. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
5f7dce0 to
9adb03d
Compare
Setup:
pip install --no-build-isolation -e '.[test]'Simulation mode:
On a2a3 device:
Accuracy and benchmark scripts (require torch npu 2.9.0):
Benchmark: