Refactor: a2a3 swimlane AICore-self-identifying records (level=1 bypasses complete_task)#974
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR refactors L2 swimlane profiling to encode full task tokens instead of low-word task IDs in AICore records, decouple buffer rotation from completion paths via explicit signaling and scheduler-loop polling, extend dependency records with per-subslot kernel IDs, and add host-side core-type mapping with a fallback JSON export path for minimal profiling levels. ChangesSwimlane Task Identity, Signaling, and Rotation Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request optimizes and decouples the L2 swimlane profiling mechanism, particularly for AICORE_TIMING (level 1). It introduces a direct buffer-rotation signaling mechanism (L2SwimlaneAicoreSignal) from AICore to AICPU, allowing level 1 profiling to bypass the expensive AICPU complete_task hot path. It also updates schemas to carry the full 64-bit task_token_raw so that host-side post-processing can resolve task-to-kernel mappings. The reviewer identified two critical issues: a static array (s_observed_aicore_buf_full_seq) is not reset during initialization, which can cause incorrect buffer rotations on subsequent runs, and base_time_cycles remains uninitialized at level 1, leading to timestamp underflow and broken timeline alignments in the exported JSON.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp (1)
745-781:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAdd AICore rotation polling to host_build_graph at AICORE_TIMING (level=1)
At
L2SwimlaneLevel::AICPU_TIMINGgating blocks insrc/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp(~745-781, ~853-872, ~899-915),l2_swimlane_aicpu_complete_taskis skipped whenl2_swimlane_level < AICPU_TIMING(so level=1 bypasses it). However, AICore-side buffer rotation is decoupled froml2_swimlane_aicpu_complete_taskand is driven byl2_swimlane_aicpu_poll_aicore_rotation(readsaicore_sig.buf_full_seqand callsaicore_rotate;aicore_rotateis only called from that poll). That poll is called from the tensormap_and_ringbuffer scheduler (scheduler_dispatch.cpp), but there is no call to it anywhere insrc/a2a3/runtime/host_build_graph/, so rotations won’t happen during the run—only the end-of-runl2_swimlane_aicpu_flushenqueues the current AICore buffer (capacityPLATFORM_AICORE_BUFFER_SIZE = 1024).
Addl2_swimlane_aicpu_poll_aicore_rotation(...)to the host_build_graph executor’s main loop so level=1 still rotates buffers as AICore fills them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp` around lines 745 - 781, The host_build_graph AICPU executor loop is not calling l2_swimlane_aicpu_poll_aicore_rotation, so AICore buffer rotations never occur at L2SwimlaneLevel::AICPU_TIMING (level=1); add a call to l2_swimlane_aicpu_poll_aicore_rotation(...) inside the executor main loop (near the existing l2_swimlane_aicpu_complete_task calls in the AICPU handling blocks) guarded by l2_swimlane_level >= L2SwimlaneLevel::AICPU_TIMING so rotations are polled each iteration (use the same core_id/thread_idx/dispatch timestamp context as the surrounding code and ensure dispatch_timestamps_ semantics are preserved).
🧹 Nitpick comments (1)
src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h (1)
138-144: 💤 Low valueConsider replacing magic number with
sizeof(L2SwimlaneActiveHead).The
+ 64offset is correct per the static_assert ABI lock, but usingsizeof(L2SwimlaneActiveHead)would make the relationship explicit and self-documenting if the struct ever changes (the static_assert would catch any mismatch).♻️ Optional refactor
__gm__ L2SwimlaneAicoreSignal *sig = - reinterpret_cast<__gm__ L2SwimlaneAicoreSignal *>(reinterpret_cast<__gm__ uint8_t *>(head) + 64); + reinterpret_cast<__gm__ L2SwimlaneAicoreSignal *>(reinterpret_cast<__gm__ uint8_t *>(head) + sizeof(L2SwimlaneActiveHead));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h` around lines 138 - 144, The code uses a magic offset "+ 64" to compute the L2SwimlaneAicoreSignal pointer from head; replace that hard-coded 64 with sizeof(L2SwimlaneActiveHead) to document the ABI relationship and keep the static_assert meaningful—update the expression that builds sig (the reinterpret_cast sequence referencing head) to add sizeof(L2SwimlaneActiveHead) instead of 64, ensure L2SwimlaneActiveHead is visible/forward-declared in this header so sizeof is available, and leave the existing static_assert and assignment to sig->buf_full_seq (and subsequent dcci/dsb calls) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 900-935: The base_time_cycles fallback causes underflow because
collected AICore records aren't considered when l2_swimlane_level_ ==
L2SwimlaneLevel::AICORE_TIMING; update the base_time_cycles computation to also
scan collected_aicore_records_ and take the minimum start_time (same min logic
used for tagged_records/phase records) when in that mode. Specifically, after
the phase-record base_time calculation (the block that currently computes
base_time_cycles from tagged_records and phase records), add a loop over
collected_aicore_records_ (iterating core_id 0..num_aicore_-1 and each r in
collected_aicore_records_[core_id]) that skips r.start_time==0 and sets
base_time_cycles = std::min(base_time_cycles, r.start_time); this ensures
cycles_to_us(r.start_time - base_time_cycles) cannot underflow in the
AICORE_TIMING fallback.
---
Outside diff comments:
In `@src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp`:
- Around line 745-781: The host_build_graph AICPU executor loop is not calling
l2_swimlane_aicpu_poll_aicore_rotation, so AICore buffer rotations never occur
at L2SwimlaneLevel::AICPU_TIMING (level=1); add a call to
l2_swimlane_aicpu_poll_aicore_rotation(...) inside the executor main loop (near
the existing l2_swimlane_aicpu_complete_task calls in the AICPU handling blocks)
guarded by l2_swimlane_level >= L2SwimlaneLevel::AICPU_TIMING so rotations are
polled each iteration (use the same core_id/thread_idx/dispatch timestamp
context as the surrounding code and ensure dispatch_timestamps_ semantics are
preserved).
---
Nitpick comments:
In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`:
- Around line 138-144: The code uses a magic offset "+ 64" to compute the
L2SwimlaneAicoreSignal pointer from head; replace that hard-coded 64 with
sizeof(L2SwimlaneActiveHead) to document the ABI relationship and keep the
static_assert meaningful—update the expression that builds sig (the
reinterpret_cast sequence referencing head) to add sizeof(L2SwimlaneActiveHead)
instead of 64, ensure L2SwimlaneActiveHead is visible/forward-declared in this
header so sizeof is available, and leave the existing static_assert and
assignment to sig->buf_full_seq (and subsequent dcci/dsb calls) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5f4fb728-77e3-424a-b130-7e41019a25f3
📒 Files selected for processing (19)
docs/dfx/dep_gen.mdsrc/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.hsrc/a2a3/platform/include/aicpu/dep_gen_collector_aicpu.hsrc/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a2a3/platform/include/common/dep_gen.hsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/include/host/l2_swimlane_collector.hsrc/a2a3/platform/shared/aicpu/dep_gen_collector_aicpu.cppsrc/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/shared/host/l2_swimlane_collector.cppsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/docs/profiling_levels.mdsrc/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
Threads the orchestrator-side kernel_id triple into each DepGenRecord so the host post-processor can resolve (task_id -> kernel) offline. This is the identity-side groundwork for the upcoming swimlane refactor that moves AICore self-identification out of the AICPU complete_task hot path -- the AICore record will carry only timing + task_token_raw, and the viewer will join func_id from deps.json the same way fanout already does. Schema change is in-place: 12B kernel_id[3] steals from _pad0[32], so sizeof(DepGenRecord) stays 2624B and tensors[] stays cache-line aligned at offset 576 (static_asserts unchanged). Replay emits kernel_ids in the JSON tasks[] entries; INVALID_KERNEL_ID (-1) marks inactive subslots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8a19d8a to
b5c0eac
Compare
Decouple AICore buffer rotation from the AICPU complete_task hot path. Previously every complete_task bumped ac_state->head.total_record_count and, every PLATFORM_AICORE_BUFFER_SIZE-th completion, called aicore_rotate. This trigger required complete_task to run on every FIN, blocking the upcoming AICORE_TIMING (level=1) bypass. New mechanism: AICore writes a fresh seq into a dedicated 64B signal cache line (L2SwimlaneAicoreSignal, sitting between head and free_queue in L2SwimlaneAicoreTaskPool) the moment it emits the last slot of its current buffer. The scheduler main loop calls l2_swimlane_aicpu_poll_aicore_rotation once per iteration -- one rmb + per-owned-core compare -- and rotates whenever the observed seq advances. Separate cache line keeps AICore's dcci CACHELINE_OUT from clobbering AICPU's per-task stats writes to head under software-managed coherence. Side effects: - Pool size 192B -> 256B (head 64 + aicore_sig 64 + free_queue 128); static_asserts updated. Host allocator uses sizeof() so it picks the new size automatically. - AICPU stats counter (total_record_count) still bumps in complete_task so host reconciliation (total == collected + dropped) keeps reporting per-AICore-pool counts at level >= 2. At level=1 (future Phase 2) it will stop bumping; host will surface the gap as silent_loss = 0 if reconcile uses the AICore-pool counter instead, or as the existing AICPU-pool counter otherwise. - Closes the dispatch-boundary race the counter-based trigger had under sub-µs kernels (AICore now signals from the slot write itself, not from a monotonic FIN counter that could fall behind AICore's fill). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es complete_task
Two coupled changes:
1) AICore record schema (L2SwimlaneAicoreTaskRecord) replaces the 32-bit
`task_id` (low bits of the per-core reg dispatch token) with a 64-bit
`task_token_raw` carrying the full PTO2 encoding (ring_id << 32 |
local_id) for tensormap_and_ringbuffer, or the zero-extended task index
for host_build_graph. Record stays 32B half-cache-line. The AICore
helper reads `task_token` straight from
`exec_payload->local_context.async_ctx.task_token.raw` -- already in
AICore cache from the just-completed task, so the new identity comes
at no extra GM load. Layout invariants (sizeof == 32) hold.
2) scheduler_completion + host_build_graph aicpu_executor: gate the entry
into `l2_swimlane_aicpu_complete_task` on
`l2_swimlane_level >= AICPU_TIMING`. At AICORE_TIMING (level=1) the
AICore record alone carries identity; AICPU adds nothing useful and
the per-completion hot path (counter inc, ring lookup, identity store,
wmb) is elided entirely. Phase 3's signal-based AICore rotation makes
this safe -- rotations no longer piggy-back on complete_task.
Host collector: join key moves from per-core `reg_task_id` (low 32) to
the full PTO2 `task_id`. AICore record's `task_token_raw` matches the
AICPU record's `task_id`; both are 64-bit, both are the canonical
identity. The kDirectIndexCap vector fast path goes away -- PTO2 task ids
are sparse high values, so unordered_map is the right structure.
Known follow-up (next commit in this PR): at level=1 there are no AICPU
records, so the host emit loop currently produces zero tasks[] entries.
The Phase 4 commit adds an AICore-records-only path that synthesizes
JSON entries from {task_token_raw, start, end} + dep_gen kernel_ids[]
+ host-side core_id -> core_type derivation, making level=1 useful
end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b5c0eac to
74587da
Compare
Make AICORE_TIMING (level=1) useful end-to-end by synthesizing JSON
tasks[] entries from the AICore record stream alone -- no AICPU records
exist at this level because complete_task is bypassed by Phase 1+2.
Three pieces:
1. set_core_types(types, n) on L2SwimlaneCollector + sim device_runner
call. Sim populates from runtime.workers[i].core_type. Onboard not
wired in this commit (its workers[].core_type is established via
AICore handshake which races with init_l2_swimlane); the existing
level>=2 path on onboard still works because AICPU records carry
core_type directly. Onboard level=1 wiring is a follow-up.
2. AICPU flush path falls back to enqueueing the full buffer
(count=PLATFORM_AICORE_BUFFER_SIZE) at level=1 instead of the
total_record_count-based live count, since complete_task -- where
total_record_count is bumped -- is bypassed at level=1. The host
copy_aicore_buffer already skips trailing slots with start_time==0,
so over-stating count costs only a scan pass, not spurious records.
3. export_swimlane_json: when collected_perf_records_ is empty AND level
== AICORE_TIMING, synthesize one task[] entry per AICore record from
{task_token_raw, start, end} + the published core_types_ table for
core_type. func_id is emitted as -1; swimlane_converter.py joins it
from deps.json by task_id (same pattern fanout already uses).
Verified: at level=1 the JSON has 5 tasks for the 5-task vector_example,
each with the full PTO2 task_id, derived ring_id, correct core_type
("aiv"), and valid timing. level>=2 paths unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctor - profiling_levels.md: level 1 row now says "AICore timing only, complete_task bypassed"; new paragraph on how host derives func_id (deps.json kernel_ids[]) and core_type (L2SwimlaneCollector::set_core_types) at level 1; describes the buf_full_seq + scheduler poll rotation mechanism. - dep_gen.md: capture call site row notes the new a2a3-only kernel_id[3] field and its role as the swimlane identity bridge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
74587da to
20ba6a2
Compare
Summary
Refactor a2a3 swimlane so AICORE_TIMING (level=1) skips the AICPU
complete_taskhot path entirely. The AICore record self-identifies via the full PTO2task_token_raw(read from the dispatch payload'sLocalContext.async_ctx.task_token— already in AICore cache), AICore buffer rotation is decoupled fromcomplete_taskvia a dedicated signal channel + scheduler-loop poll, and dep_gen + host derivation cover the remaining identity gaps (func_id,core_type).a2a3-only. a5's staging-ring model is different; out of scope here.
Commits
Add: dep_gen captures per-subtask kernel_ids — extends
DepGenRecordwithint32_t kernel_id[3](steals from_pad0[32];sizeofandtensors[]offset unchanged). Orchestrator passes{aic, aiv0, aiv1}_kernel_idtodep_gen_aicpu_record_submit; replay emitskernel_idsindeps.json.Refactor: AICore buffer rotation via self-signal — Adds
L2SwimlaneAicoreSignal64B cacheline toL2SwimlaneAicoreTaskPool(pool 192B → 256B). AICore writesbuf_full_seqwhen emitting last slot; scheduler calls newl2_swimlane_aicpu_poll_aicore_rotationonce per main loop. Removes the piggy-back trigger incomplete_task. Separate cacheline avoids false sharing with AICPU's per-task stats writes.Refactor: AICore record self-identifies + level=1 bypass —
L2SwimlaneAicoreTaskRecordreplaces 32-bittask_idwith 64-bittask_token_raw(still 32B half-cacheline).scheduler_completion+host_build_graph aicpu_executorgate entry intocomplete_taskonlevel >= AICPU_TIMING. Host join key moves from per-corereg_task_idto full PTO2 task id (cleaner, runtime-agnostic).Add: level=1 host emit path (AICore-only synthesis) —
set_core_types()on collector + sim device_runner wiring. AICPU flush falls back to full-buffer count at level=1.export_swimlane_jsonsynthesizestasks[]from AICore records whencollected_perf_records_is empty andlevel == AICORE_TIMING;func_idemitted as-1and joined fromdeps.jsonbyswimlane_converter.pyat post-process time (same model fanout already uses).Doc updates —
profiling_levels.md+docs/dfx/dep_gen.md.Test plan
pytest tests/st/a2a3/tensormap_and_ringbuffer/dfx --platform a2a3sim7/7 passtask_id, derivedring_id, correctcore_type(aiv), valid timing,func_id=-1(to be joined fromdeps.json)task_idis full PTO2, ring_id decoded, all fields populatedtask-submit --device auto --device-num 2; both attempts hithalMemCtl rc=13 (EACCES)during register init before swimlane code is reached (device contention,npu-smi infoshows all devices near 100% from other users). Reviewers should re-run when devices free up.Related
{task_token_raw, dispatch, finish}) for level >= 2 — deferred; this PR keeps the existingL2SwimlaneAicpuTaskRecord. Onboardset_core_typeswiring (handshake-discovered core_type table). a5 swimlane (different staging-ring model; not in scope).🤖 Generated with Claude Code