Skip to content

High performance Paged Attention A2A3 ST Test#899

Merged
ChaoWao merged 2 commits into
hw-native-sys:mainfrom
MirkoDeVita98:pr-655-work
Jun 2, 2026
Merged

High performance Paged Attention A2A3 ST Test#899
ChaoWao merged 2 commits into
hw-native-sys:mainfrom
MirkoDeVita98:pr-655-work

Conversation

@MirkoDeVita98
Copy link
Copy Markdown
Contributor

@MirkoDeVita98 MirkoDeVita98 commented May 29, 2026

Setup:

pip install --no-build-isolation -e '.[test]'

Simulation mode:

python test_spmd_paged_attention_highperf.py -p a2a3sim

=== Runtime: tensormap_and_ringbuffer  Level: 2 ===
  TestSpmdPagedAttentionHighPerf::b1_h32_kv8_s128_bs128_fp16 ... [2026-06-01 12:11:36.847047][T0xe8316687a100][INFO_V9] run: [aicpu_executor.cpp:665] Thread 3: orch_start=89015794842350988 orch_end=89015794842352265 orch_cost=25.540us
[2026-06-01 12:11:36.847195][T0xe8316687a100][INFO_V9] run: [aicpu_executor.cpp:671] PTO2 total submitted tasks = 1, already executed 0 tasks
[2026-06-01 12:11:37.598466][T0xe8316708a100][INFO_V9] log_l2_perf_summary: [scheduler_cold_path.cpp:383] Thread 1: sched_start=89015794842305279 sched_end=89015794879923141 sched_cost=752357.240us
[2026-06-01 12:11:37.598658][T0xe8316708a100][INFO_V9] log_l2_perf_summary: [scheduler_cold_path.cpp:518] Thread 1: Scheduler summary: total_time=702012.480us, loops=441135, tasks_scheduled=24
[2026-06-01 12:11:37.598473][T0xe8316d0b8100][INFO_V9] log_l2_perf_summary: [scheduler_cold_path.cpp:383] Thread 0: sched_start=89015794842305277 sched_end=89015794879923135 sched_cost=752357.160us
[2026-06-01 12:11:37.598828][T0xe8316d0b8100][INFO_V9] log_l2_perf_summary: [scheduler_cold_path.cpp:518] Thread 0: Scheduler summary: total_time=733236.560us, loops=203564, tasks_scheduled=24
[2026-06-01 12:11:37.598472][T0xe8316c8a8100][INFO_V9] log_l2_perf_summary: [scheduler_cold_path.cpp:383] Thread 2: sched_start=89015794842305274 sched_end=89015794879923144 sched_cost=752357.400us
[2026-06-01 12:11:37.598901][T0xe8316c8a8100][INFO_V9] log_l2_perf_summary: [scheduler_cold_path.cpp:518] Thread 2: Scheduler summary: total_time=737581.000us, loops=160404, tasks_scheduled=0
PASSED

On a2a3 device:

python test_spmd_paged_attention_highperf.py -p a2a3


=== Runtime: tensormap_and_ringbuffer  Level: 2 ===
  TestSpmdPagedAttentionHighPerf::b1_h32_kv8_s128_bs128_fp16 ... PASSED

Accuracy and benchmark scripts (require torch npu 2.9.0):

bash compile.sh
python pa_accuracy.py 
Device: npu:0  cube_cores=24

Running fp16 tests (each case in an isolated subprocess):
PASS  b1_h32_kv8_s128_bs128  mean_err=0.00000  max_err=0.00012
PASS  b4_h32_kv8_s512_bs128  mean_err=0.00001  max_err=0.00012
PASS  b2_h8_kv8_s256_bs128  mean_err=0.00001  max_err=0.00024
PASS  b8_h32_kv8_s1024_bs128  mean_err=0.00001  max_err=0.00012
PASS  b1_h32_kv8_s2048_bs128  mean_err=0.00001  max_err=0.00012
PASS  b4_h64_kv8_s1024_bs128  mean_err=0.00001  max_err=0.00012

All fp16 cases PASSED.

Benchmark:

bash compile.sh
python bench_pa_performance.py 
Device: npu:0  cube_cores=24
dtype=torch.float16  warmup=5  iters=20
Standalone lib: /mounted_home/simpler/tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_lib.so
case | API | ms | TFLOP/s | GiB/s | AI (F/B)
(FLOPs = QK+PV matmuls; Bytes = logical Q+K+V+O)
---
Qwen3-0.6B b1 h16/kv8 kv2048 | standalone (pa_lib.so)       |  0.0707 ms |  0.2373 TFLOP/s | 110.5944 GiB/s | AI=1.9980 F/B
Qwen3-0.6B b1 h16/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.0803 ms |  0.2089 TFLOP/s | 97.3634 GiB/s | AI=1.9980 F/B
Qwen3-0.6B b1 h16/kv8 kv2048 | speedup standalone/IFA: 1.136x

Qwen3-1.7B b1 h16/kv8 kv4096 | standalone (pa_lib.so)       |  0.0668 ms |  0.5023 TFLOP/s | 234.0284 GiB/s | AI=1.9990 F/B
Qwen3-1.7B b1 h16/kv8 kv4096 | npu_incre_flash_attention (paged) |  0.0766 ms |  0.4382 TFLOP/s | 204.1666 GiB/s | AI=1.9990 F/B
Qwen3-1.7B b1 h16/kv8 kv4096 | speedup standalone/IFA: 1.146x

Qwen3-4B   b1 h32/kv8 kv2048 | standalone (pa_lib.so)       |  0.0656 ms |  0.5114 TFLOP/s | 119.3092 GiB/s | AI=3.9922 F/B
Qwen3-4B   b1 h32/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.0803 ms |  0.4179 TFLOP/s | 97.4875 GiB/s | AI=3.9922 F/B
Qwen3-4B   b1 h32/kv8 kv2048 | speedup standalone/IFA: 1.224x

Qwen3-8B   b1 h32/kv8 kv4096 | standalone (pa_lib.so)       |  0.0688 ms |  0.9753 TFLOP/s | 227.3128 GiB/s | AI=3.9961 F/B
Qwen3-8B   b1 h32/kv8 kv4096 | npu_incre_flash_attention (paged) |  0.0844 ms |  0.7947 TFLOP/s | 185.2080 GiB/s | AI=3.9961 F/B
Qwen3-8B   b1 h32/kv8 kv4096 | speedup standalone/IFA: 1.227x

Qwen3-8B   b1 h32/kv8 kv8192 | standalone (pa_lib.so)       |  0.0852 ms |  1.5748 TFLOP/s | 366.8297 GiB/s | AI=3.9980 F/B
Qwen3-8B   b1 h32/kv8 kv8192 | npu_incre_flash_attention (paged) |  0.0819 ms |  1.6386 TFLOP/s | 381.7026 GiB/s | AI=3.9980 F/B
Qwen3-8B   b1 h32/kv8 kv8192 | speedup standalone/IFA: 0.961x

Qwen3-14B  b1 h40/kv8 kv2048 | standalone (pa_lib.so)       |  0.0647 ms |  0.6478 TFLOP/s | 120.9565 GiB/s | AI=4.9878 F/B
Qwen3-14B  b1 h40/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.0835 ms |  0.5025 TFLOP/s | 93.8329 GiB/s | AI=4.9878 F/B
Qwen3-14B  b1 h40/kv8 kv2048 | speedup standalone/IFA: 1.289x

Qwen3-32B  b1 h64/kv8 kv2048 | standalone (pa_lib.so)       |  0.0650 ms |  1.0329 TFLOP/s | 120.7157 GiB/s | AI=7.9689 F/B
Qwen3-32B  b1 h64/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.0847 ms |  0.7925 TFLOP/s | 92.6239 GiB/s | AI=7.9689 F/B
Qwen3-32B  b1 h64/kv8 kv2048 | speedup standalone/IFA: 1.303x

MHA        b1 h32/kv32 kv2048 | standalone (pa_lib.so)       |  0.0649 ms |  0.5173 TFLOP/s | 481.9976 GiB/s | AI=0.9995 F/B
MHA        b1 h32/kv32 kv2048 | npu_incre_flash_attention (paged) |  0.0835 ms |  0.4021 TFLOP/s | 374.6541 GiB/s | AI=0.9995 F/B
MHA        b1 h32/kv32 kv2048 | speedup standalone/IFA: 1.287x

Qwen3-8B   b4 h32/kv8 kv2048 | standalone (pa_lib.so)       |  0.0730 ms |  1.8388 TFLOP/s | 428.9653 GiB/s | AI=3.9922 F/B
Qwen3-8B   b4 h32/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.0857 ms |  1.5654 TFLOP/s | 365.1816 GiB/s | AI=3.9922 F/B
Qwen3-8B   b4 h32/kv8 kv2048 | speedup standalone/IFA: 1.175x

Qwen3-8B   b8 h32/kv8 kv2048 | standalone (pa_lib.so)       |  0.0994 ms |  2.7003 TFLOP/s | 629.9310 GiB/s | AI=3.9922 F/B
Qwen3-8B   b8 h32/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.0859 ms |  3.1261 TFLOP/s | 729.2660 GiB/s | AI=3.9922 F/B
Qwen3-8B   b8 h32/kv8 kv2048 | speedup standalone/IFA: 0.864x

Qwen3-8B  b16 h32/kv8 kv2048 | standalone (pa_lib.so)       |  0.1191 ms |  4.5083 TFLOP/s | 1051.7117 GiB/s | AI=3.9922 F/B
Qwen3-8B  b16 h32/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.1044 ms |  5.1425 TFLOP/s | 1199.6680 GiB/s | AI=3.9922 F/B
Qwen3-8B  b16 h32/kv8 kv2048 | speedup standalone/IFA: 0.877x

Qwen3-8B  b32 h32/kv8 kv2048 | standalone (pa_lib.so)       |  0.2431 ms |  4.4177 TFLOP/s | 1030.5784 GiB/s | AI=3.9922 F/B
Qwen3-8B  b32 h32/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.2576 ms |  4.1686 TFLOP/s | 972.4755 GiB/s | AI=3.9922 F/B
Qwen3-8B  b32 h32/kv8 kv2048 | speedup standalone/IFA: 1.060x

Qwen3-8B  b64 h32/kv8 kv2048 | standalone (pa_lib.so)       |  0.4810 ms |  4.4650 TFLOP/s | 1041.6288 GiB/s | AI=3.9922 F/B
Qwen3-8B  b64 h32/kv8 kv2048 | npu_incre_flash_attention (paged) |  0.5307 ms |  4.0468 TFLOP/s | 944.0560 GiB/s | AI=3.9922 F/B
Qwen3-8B  b64 h32/kv8 kv2048 | speedup standalone/IFA: 1.103x

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0b12b88b-cae8-4cb3-b8c0-fd6842aa0302

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a complete high-performance paged attention kernel for GQA decoding with CPU-sim and hardware implementations, tiling logic, orchestration, and comprehensive testing and benchmarking support across Python and C++.

Changes

Paged Attention Highperf

Layer / File(s) Summary
Tiling data structures and constants
kernels/tiling/pa_tiling_struct.h
C++ header defining constexpr limits, enums (TilingKeyType, CalcType, DataShapeType, CompressType, PagedAttnVariant), and struct type aliases (PagedAttentionInfo, AddrOffsets) for tiling configuration.
Paged attention kernel entry and dispatch
kernels/aic/paged_attention_highperf.cpp, kernels/kernel/pa_entry.cce, kernels/paged_attention_wrapper.cpp
CPU-sim fallback using half-precision helpers and scalar GQA loops; hardware path mapping GM pointers and delegating to paged_attention_mask_body with dtype/split dispatch; CCE entry that loads tiling key and selects CUBE/VEC implementations; C++ wrapper exposing get_ffts_info and call_kernel for FFTs control and kernel launch.
Task orchestration and dispatch
kernels/orchestration/paged_attention_highperf_orch.cpp
Exports aicpu_orchestration_config (16-arg count) and aicpu_orchestration_entry that parses block_dim, registers tensors with input/inout roles, and submits mixed-kernel tasks via rt_submit_task.
Tiling parameter computation
kernels/pa_tiling.py
Implements make_pa_nd_decode_tiling to compute effective block dimensions, select (batch, head) or (batch, head, KV-seq) split strategies, encode offset fields, and return int32 tiling tensor; includes workspace_sizes to derive per-scratch byte allocations.
Numerical correctness testing
kernels/pa_accuracy.py
Loads pa_lib.so via ctypes, packs dense KV into paged blocks, runs custom kernel with workspace caching and NPU synchronization, compares against torch_npu.npu_incre_flash_attention reference, and manages per-case subprocess isolation with environment-driven NPU device selection.
Performance benchmarking
kernels/bench_pa_performance.py
Compares custom kernel against NPU API using FLOP and GiB/s metrics, implements CustomPARunner to precompute and launch tiling/workspaces, measures runtime via NPU events, and reports throughput and speedup across predefined Qwen/MHA-like shapes with CLI controls for device/dtype/iterations.
Scene test case and golden validation
test_spmd_paged_attention_highperf.py
SceneTestCase wiring orchestration + AIC kernel metadata, generates randomized Q/K/V with paged KV caches and tiling/workspace tensors, computes scalar GQA golden output by iterating batches/heads/blocks, and validates device results.
Build system and documentation
kernels/compile.sh, kernels/.gitignore, kernels/README.md
Bash script using bisheng with dav-2201 arch and ASCEND toolkit includes; README documenting compilation, accuracy, and performance benchmark commands with reference result tables and source command; .gitignore excluding pa_lib.so.

🎯 4 (Complex) | ⏱️ ~60 minutes

A rabbit hops through layers of attention,
Paging through KV with high-perf intention,
AICORE spins fast, while CPU-sim lends grace,
Tiling and testing keep everything in place! 🐰✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a description explaining the purpose, scope, and key components of the paged attention implementation being added.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'High performance Paged Attention A2A3 ST Test' accurately describes the main changeset, which adds comprehensive test infrastructure and kernel implementations for a high-performance paged attention mechanism for the A2A3 platform's system test suite.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a high-performance SPMD paged attention implementation, including C++ kernels, tiling logic, benchmarks, and correctness tests, alongside compiler updates to dynamically include CANN directories. The review feedback recommends adding defensive input validation in the tiling and kernel code to prevent division-by-zero and null pointer dereferences, checking environment variables in the compilation script, and adding future annotations in the test script.

hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
Fixes hw-native-sys#900

The AICore kernel loader (`simpler_setup/elf_parser.py`) silently
dropped `.text._Z*` group sections (out-of-line template
instantiations) and `.rela.text*` relocations when extracting a `.text`
payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then
branched to garbage on device, manifesting as CANN 507018 watchdog
timeouts (issue hw-native-sys#831 / PR hw-native-sys#830) or silently-wrong partial output
(issue hw-native-sys#900). Both symptoms are extremely hard to root-cause from the
runtime error alone.

This change is the minimum to keep the next person from repeating that
diagnostic loop: a pre-flight scan that fails loud if the `.o`
contains `.text._Z*` or any `.rela.text*` entries. The error names the
offending sections and points at the `always_inline` kernel-side
workaround. The literal-`.text` extraction path is otherwise unchanged
— working kernels stay byte-identical (verified against the PA
highperf `.o` from PR hw-native-sys#899 and the existing fully-inlined kernels).

Loader-side relocation application is a separable follow-up; this PR
just closes the silent-failure mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit that referenced this pull request Jun 1, 2026
Fixes #900

The AICore kernel loader (`simpler_setup/elf_parser.py`) silently
dropped `.text._Z*` group sections (out-of-line template
instantiations) and `.rela.text*` relocations when extracting a `.text`
payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then
branched to garbage on device, manifesting as CANN 507018 watchdog
timeouts (issue #831 / PR #830) or silently-wrong partial output
(issue #900). Both symptoms are extremely hard to root-cause from the
runtime error alone.

This change is the minimum to keep the next person from repeating that
diagnostic loop: a pre-flight scan that fails loud if the `.o`
contains `.text._Z*` or any `.rela.text*` entries. The error names the
offending sections and points at the `always_inline` kernel-side
workaround. The literal-`.text` extraction path is otherwise unchanged
— working kernels stay byte-identical (verified against the PA
highperf `.o` from PR #899 and the existing fully-inlined kernels).

Loader-side relocation application is a separable follow-up; this PR
just closes the silent-failure mode.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/aic/paged_attention_highperf.cpp`:
- Around line 146-149: The AICORE path in tensor_data currently returns only the
allocation base (tensor->buffer.addr) and ignores Tensor::start_offset so
sliced/view tensors point to the wrong GM region; update tensor_data (the
__aicore__ function) to add the tensor->start_offset to the returned pointer
(i.e., return the buffer base plus start_offset) matching the __CPU_SIM behavior
and ensuring the pointer arithmetic uses the correct byte offset/type
conversions.
- Around line 88-100: The code assumes a uniform seq_len via blocks_per_batch =
key_t->shapes[0] / batch and then clamps block_col against max_blocks_per_query;
instead derive the sequence length per query from the per-query
block-table/metadata and use that per-batch-item when iterating. Specifically,
replace use of the global blocks_per_batch/seq_len with a per-query blocks_count
(e.g., read the valid block count or end index for each b from the block-table
metadata or a companion array) and compute seq_len = blocks_count * block_size
for that b, and when computing block_col or indexing block_table[b *
max_blocks_per_query + block_col] clamp against that per-query blocks_count
rather than the global max_blocks_per_query so heterogeneous paged contexts use
their own valid lengths.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/bench_pa_performance.py`:
- Around line 212-216: The benchmark measures asymmetric scopes: CustomPARunner
pre-allocates tiling/workspace/output before timing but forward_incre_flash /
npu_incre_flash_attention is timed end-to-end, so the reported speedup is
misleading; fix by making scopes comparable—either move the CustomPARunner setup
(tiling/workspace/output allocation) into the timed loop so
benchmark_with_events measures the same per-iteration work as
forward_incre_flash, or pre-allocate and reuse the IFA path's workspace/outputs
so benchmark_with_events only times the core kernel call for
forward_incre_flash; alternatively, if you intend to show amortized steady-state
for CustomPARunner, explicitly relabel ms_custom as an amortized/steady-state
metric and document that the setup is excluded. Ensure you reference
CustomPARunner, benchmark_with_events, forward_incre_flash, and
npu_incre_flash_attention when making the change so both paths time equivalent
scopes.
- Around line 64-66: pack_kv_to_paged currently assumes L is an exact multiple
of block_size and will later fail with a generic view() error; add an explicit
guard at the start of pack_kv_to_paged to check if L % block_size != 0 and raise
a clear ValueError (or AssertionError) that includes L and block_size (and
optionally nkv) in the message so unsupported non-block-aligned kv lengths fail
fast and are easy to debug.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/compile.sh`:
- Around line 4-16: Add an explicit preflight check for ASCEND_TOOLKIT_HOME at
the top of compile.sh (immediately after SCRIPT_DIR is set) to fail fast with a
clear error: test that the environment variable ASCEND_TOOLKIT_HOME is set and
non-empty (e.g., using parameter expansion or [ -z "${ASCEND_TOOLKIT_HOME:-}"
]), print a descriptive message to stderr, and exit 1 if it is unset so the
subsequent bisheng invocation and all -I"${ASCEND_TOOLKIT_HOME}/..." includes
cannot accidentally trigger a cryptic failure via set -u.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_accuracy.py`:
- Around line 166-217: Hoist all non-launch work out of run_custom: compute
make_pa_nd_decode_tiling(...) once on setup, allocate workspace buffers returned
by workspace_sizes(...) (the ws_buf logic), pre-create o = torch.zeros(...),
null = empty_buf(device), and materialize bt_npu = bt.to(device) in a new setup
function (or return a prepared "launch_state"); then change run_custom to accept
that prepared state and only perform the synchronizations and call _launch(lib,
eff_bd, stream, q, k_page, v_page, bt_npu, null, o, s_gm, p_gm, o_tmp_gm, go_gm,
o_core, l_gm, gm_k16, gm_v16, tiling). Apply the same hoisting refactor to the
analogous block referenced at lines 222-288 so the timed kernel launch does not
rebuild tiling or reallocate buffers.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`:
- Around line 260-261: indices is a sorted_rank -> original_index mapping, but
the kernel needs the inverse (original_index -> sorted_rank) when writing
per-sequence metadata (e.g., slot 13 and the block at the same area referenced
by indices between the earlier mentioned range); compute an inverse permutation
(call it inv_indices) of length batch such that inv_indices[original_index] =
sorted_rank, and use inv_indices wherever you currently index by indices to
write per-sequence entries (including the slot 13 write and the other writes in
the block covering the same region), ensuring all per-sequence metadata is
associated with the original batch index.
- Around line 429-452: workspace_sizes currently ignores the batch parameter
causing l and o_core_tmp to under-allocate relative to tiling offsets (see
make_pa_nd_decode_tiling which grows addr_l and addr_ofd per sequence). Fix
workspace_sizes by incorporating batch into the computation of o_core and l_size
(e.g. multiply existing o_core and l_size formulas by batch or otherwise scale
them by the number of sequences per batch), keep the existing int(...) and
SPLITKV_RATIO usage and preserve the max(16, ...) guards for "o_core_tmp" and
"l". This ensures "l" and "o_core_tmp" (referenced in make_pa_nd_decode_tiling
via addr_l/addr_ofd) are sized to cover per-batch growth and prevents OOB writes
in split-KV cases.
- Around line 310-317: The tiling_key computation can emit values
(128/129/144/145) that pa_entry.cce doesn't handle, causing the kernel to fall
through; modify the is_split_block logic in pa_tiling.py so it never sets the
high-bit used to produce 128+ values when the kernel path doesn't support them
(i.e., force is_split_block = 0 or add a guard that clears is_split_block for
the block_size/head_dim/head_dim_v case), then recompute tiling_key =
(is_split_block << 7) + (is_split_key << 4) + type_key so only keys 0,1,16,17
are produced and dispatched by the kernel.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/README.md`:
- Around line 3-7: The README’s run instructions omit required environment vars;
update README.md (the usage block near the bash commands) to document and show
exporting ASCEND_TOOLKIT_HOME before invoking compile.sh, and mention any other
env vars used by compile.sh or the Python scripts (e.g., ASCEND_TOOLKIT_HOME is
required for compile.sh and any runtime vars needed by pa_accuracy.py /
bench_pa_performance.py); include a short example export line and a note that
users must set the path to their Ascend toolkit installation.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`:
- Around line 27-32: Add an explicit fast-fail check for non-divisible KV
lengths before the block packing logic: verify that seq_len % block_size == 0
(where seq_len is derived from k_dense and block_size is used to compute
num_blocks) and raise a clear error (e.g., ValueError or assert) with a message
explaining the block-size contract so that invalid inputs to the
_pack_kv_to_paged()/k_page packing sequence fail with an informative error
instead of a later reshape/view error.
- Around line 59-61: The test assumes equal mapping of Q heads to KV heads but
never checks divisibility; before computing heads_per_kv (after extracting
num_heads and num_kv_heads from q.shape and k_page), add an assertion that
num_heads % num_kv_heads == 0 (or raise a clear error) to ensure heads_per_kv =
num_heads // num_kv_heads is integral and avoid out‑of‑bounds or incorrect head
mapping in the subsequent logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 716ecad4-10f3-4752-b279-699b28233f0a

📥 Commits

Reviewing files that changed from the base of the PR and between 22538de and 0964b46.

📒 Files selected for processing (13)
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/.gitignore
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/README.md
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/aic/paged_attention_highperf.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/bench_pa_performance.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/compile.sh
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_entry.cce
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/orchestration/paged_attention_highperf_orch.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_accuracy.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/paged_attention_wrapper.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/tiling/pa_tiling_struct.h
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py

hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request Jun 1, 2026
…y intrinsics

simpler's tensormap_and_ringbuffer runtime maintains its own SPMD
context (block_idx, block_num, sub_block_id) in LocalContext /
GlobalContext structures referenced from the kernel args[] tail.
The CCE built-in intrinsics get_subblockid(), get_block_idx(),
get_block_num() (declared in kernel_operator.h / tikcfw) read
AICore hardware registers that the runtime does NOT program, so a
kernel that mixes them with the args-based accessors gets stale
values — most importantly get_subblockid() returns 0 for BOTH
AIV0 and AIV1 of every MIX cluster, causing AIV1 to silently redo
AIV0's work and leaving AIV1's share of the output unwritten.

This was the partial-zero failure mode in issue hw-native-sys#900 / PR hw-native-sys#899
spmd_paged_attention_highperf: a kernel ported from native CANN
compiled clean, ran without error, produced half-zero output on
a2a3 hardware. Resolved kernel-side in PR hw-native-sys#899 by routing all three
IDs through the args-based accessors.

Add three layers of documentation so the next port catches this
before the same debugging round-trip:

- `docs/aicore-kernel-programming.md` (new) — the kernel-author
  contract for this runtime: SPMD execution context, accessor
  functions, logical-vs-physical block_dim, the CCE-intrinsics
  warning with porting checklist, and pointers to working
  examples. Structured so future kernel-authoring topics (tensor
  args, FFTS sync, tiling) can grow under it.
- `docs/developer-guide.md` — link from the existing Example /
  Test Layout section so someone reading the dev guide finds the
  kernel-author contract from "kernels/" without searching.
- `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h`
  — IMPORTANT block at the top of the file with the gotcha
  inline (for the grep-and-read discovery path) and a back-link
  to the programming guide for the full context.

Doc-only — no code or API changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit that referenced this pull request Jun 1, 2026
…y intrinsics (#962)

simpler's tensormap_and_ringbuffer runtime maintains its own SPMD
context (block_idx, block_num, sub_block_id) in LocalContext /
GlobalContext structures referenced from the kernel args[] tail.
The CCE built-in intrinsics get_subblockid(), get_block_idx(),
get_block_num() (declared in kernel_operator.h / tikcfw) read
AICore hardware registers that the runtime does NOT program, so a
kernel that mixes them with the args-based accessors gets stale
values — most importantly get_subblockid() returns 0 for BOTH
AIV0 and AIV1 of every MIX cluster, causing AIV1 to silently redo
AIV0's work and leaving AIV1's share of the output unwritten.

This was the partial-zero failure mode in issue #900 / PR #899
spmd_paged_attention_highperf: a kernel ported from native CANN
compiled clean, ran without error, produced half-zero output on
a2a3 hardware. Resolved kernel-side in PR #899 by routing all three
IDs through the args-based accessors.

Add three layers of documentation so the next port catches this
before the same debugging round-trip:

- `docs/aicore-kernel-programming.md` (new) — the kernel-author
  contract for this runtime: SPMD execution context, accessor
  functions, logical-vs-physical block_dim, the CCE-intrinsics
  warning with porting checklist, and pointers to working
  examples. Structured so future kernel-authoring topics (tensor
  args, FFTS sync, tiling) can grow under it.
- `docs/developer-guide.md` — link from the existing Example /
  Test Layout section so someone reading the dev guide finds the
  kernel-author contract from "kernels/" without searching.
- `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/common/intrinsic.h`
  — IMPORTANT block at the top of the file with the gotcha
  inline (for the grep-and-read discovery path) and a back-link
  to the programming guide for the full context.

Doc-only — no code or API changes.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
@MirkoDeVita98 MirkoDeVita98 changed the title High performance Paged Attention example High performance Paged Attention A2A3 ST Test Jun 1, 2026
@ChaoWao ChaoWao merged commit d61dee4 into hw-native-sys:main Jun 2, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants