Fix: detect unresolved AICore .text relocations and fail loud (#900) by hw-native-sys-bot · Pull Request #958 · hw-native-sys/simpler

hw-native-sys-bot · 2026-06-01T07:55:52Z

Summary

The AICore kernel loader (simpler_setup/elf_parser.py::extract_text_section) silently dropped .text._Z* group sections (out-of-line template instantiations) and .rela.text* relocations. The unresolved BL/B targets in .text then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue [Bug] A2A3 triangular inverse AIC kernel times out in tensormap_and_ringbuffer runtime #831 / PR Triangular Inverse Kernel (continuation of Zouzias' impl) #830) or silently-wrong partial output (issue [Bug] A2A3 spmd_paged_attention_highperf hardware run times out or produces partial zero output while a2a3sim passes #900). Both symptoms are hard to root-cause from the runtime error alone.
This PR is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the .o contains .text._Z* or any .rela.text* entries. The error names the offending sections and points at the __attribute__((always_inline)) kernel-side workaround.
Working kernels stay byte-identical — the literal-.text extraction path is unchanged. Verified against the PA highperf .o from PR High performance Paged Attention A2A3 ST Test #899 (97156 / 64804 bytes for AIC / AIV, identical to before) and the existing fully-inlined kernels.

Scope (and what's out)

✅ Detect-and-reject with actionable diagnostic — done here.
❌ Loader-side relocation application (merge .text._Z*, apply R_AARCH64_CALL26/JUMP26) — separable follow-up. Drafted but not landing in this PR to keep the change minimal and reviewable.
⚠️ Issue [Bug] A2A3 spmd_paged_attention_highperf hardware run times out or produces partial zero output while a2a3sim passes #900 also describes a partial-zero output symptom in PR High performance Paged Attention A2A3 ST Test #899's PA test on a2a3 hardware. Investigation in this branch shows the current PA .o does NOT have .text._Z* or .rela.text (every templated method is already marked always_inline), so that partial-zero symptom is not the loader gap. It's a different bug — likely SPMD work-partitioning or FFTS synchronization on the AIV side — and will need its own investigation.

Test plan

tests/ut/py/test_elf_parser.py — 7 cases: byte-identical regression guard for the common path, three rejection cases for .text._Zfoo (raises with issue link, names the offending section, mentions the always_inline workaround), two for .rela.text (raises with issue link, reports entry count), and the missing-.text case.
Local pre-commit: ruff check / ruff format / pyright / check-headers all green.
Real-world .o round-trip: compiled tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/aic/paged_attention_highperf.cpp (both AIC and AIV variants) via kernel_compiler, confirmed extract_text_section returns byte-identical payload (97156 / 64804 bytes).
Future: when the loader gains relocation-application support, drop one of the PA kernel's always_inline annotations and verify the new error fires with actionable message.

🤖 Generated with Claude Code

coderabbitai · 2026-06-01T07:56:03Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 283748ed-48ae-4ce3-8609-f5b5efad9ea5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

ELF loader validation is strengthened to reject .text sections with out-of-line code (.text.*) or unresolved relocations (.rela.text*), instead of silently returning incomplete bytes. New error reporting includes diagnostic details and workaround guidance. Comprehensive test coverage validates success and all rejection scenarios using synthetic ELF64 buffers.

Changes

ELF .text Extraction Validation

Layer / File(s)	Summary
Documentation, constants, and function contracts `simpler_setup/elf_parser.py`, `tests/ut/py/test_elf_parser.py`	Module docstring expanded to document `.text` extraction semantics, rejection of out-of-line code and relocations, and failure modes. ELF64 constants (section-header size, relocation-entry size, section types) introduced. Function docstring refined to explicitly mention out-of-line code rejection.
Core extraction and validation logic `simpler_setup/elf_parser.py`	`_extract_text_elf64` reworked to scan all section headers, capture `.text` bytes, and detect `.text.` out-of-line sections and `.rela.text` relocation sections; raises `ValueError` if any found. New `_raise_unresolved_text_error` helper formats actionable diagnostic aggregating offending section names, sizes, relocation entry counts, issue link, and always-inline workaround with verification commands.
Test infrastructure and ELF buffer builders `tests/ut/py/test_elf_parser.py`	Test module docstring documents expected rejection behavior. Packing helpers (`_pack_ehdr`, `_pack_shdr`, `_strtab`) introduced to assemble ELF headers and string tables. Specialized builders (`_build_single_text_elf`, `_build_elf_with_extra_text_section`, `_build_elf_with_rela_text`) construct synthetic ELF64 byte buffers for each test scenario.
Validation test cases `tests/ut/py/test_elf_parser.py`	`TestSingleTextNoRelocations` asserts successful extraction when `.text` is alone. `TestRejectsOutOfLineCodeSection` asserts `ValueError` when `.text._Z*` section exists, with assertions on issue link, section name, and workaround mention. `TestRejectsTextRelocations` asserts `ValueError` when `.rela.text` entries present, verifying issue link and entry count in error. `TestMissingText` asserts error when `.text` section absent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 The ELF parser hops with care,
Detecting text in disarray—
No stray sections hiding there,
No relocations to betray!
With builders true and tests so fair,
The loader's crisp and clean today! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: detecting unresolved AICore .text relocations and raising errors instead of silently dropping them, with reference to the issue number.
Description check	✅ Passed	The description thoroughly documents the problem, solution, scope, test plan, and verification approach, all directly related to the changeset of adding detection and rejection of unsupported ELF sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request enhances the ELF parser in simpler_setup/elf_parser.py to detect and reject ELF files with out-of-line code sections or unresolved relocations, preventing silent failures on AICore. It also introduces comprehensive unit tests in tests/ut/py/test_elf_parser.py to validate these changes. The review feedback suggests critical security improvements to prevent out-of-bounds reads by adding bounds checks on section header indices, string table offsets, and section data boundaries.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

simpler_setup/elf_parser.py (1)

103-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit ELF bounds validation before header/section reads.

Line 108 and Lines 121-126 trust e_shoff/e_shnum/e_shstrndx and per-section offsets without range checks. A truncated or malformed object can fail with low-level unpack/slice errors instead of a clear ValueError tied to source_name.

Proposed fix

 def _extract_text_elf64(elf_data: bytes, source_name: str) -> bytes:
@@
     e_shoff = struct.unpack("<Q", elf_data[40:48])[0]
     e_shnum = struct.unpack("<H", elf_data[60:62])[0]
     e_shstrndx = struct.unpack("<H", elf_data[62:64])[0]
+
+    section_table_end = e_shoff + e_shnum * _SHDR_SIZE
+    if e_shnum == 0 or section_table_end > len(elf_data):
+        raise ValueError(f"Invalid ELF section table in: {source_name}")
+    if e_shstrndx >= e_shnum:
+        raise ValueError(f"Invalid ELF shstrndx in: {source_name}")
@@
     shstr_offset = e_shoff + e_shstrndx * _SHDR_SIZE
+    if shstr_offset + _SHDR_SIZE > len(elf_data):
+        raise ValueError(f"Invalid ELF shstr section header in: {source_name}")
     shstr_sh_offset = struct.unpack("<Q", elf_data[shstr_offset + 24 : shstr_offset + 32])[0]
     shstr_sh_size = struct.unpack("<Q", elf_data[shstr_offset + 32 : shstr_offset + 40])[0]
+    if shstr_sh_offset + shstr_sh_size > len(elf_data):
+        raise ValueError(f"Invalid ELF shstr table range in: {source_name}")
@@
     for i in range(e_shnum):
         section_offset = e_shoff + i * _SHDR_SIZE
+        if section_offset + _SHDR_SIZE > len(elf_data):
+            raise ValueError(f"Invalid ELF section header range in: {source_name}")
@@
         if sh_type == _SHT_PROGBITS and section_name == ".text":
+            if sh_offset + sh_size > len(elf_data):
+                raise ValueError(f"Invalid .text section range in: {source_name}")
             text_data = elf_data[sh_offset : sh_offset + sh_size]

Also applies to: 121-131

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@simpler_setup/elf_parser.py` around lines 103 - 111, Validate ELF buffer
bounds before performing struct.unpack/slices: ensure e_shoff, e_shnum and
e_shstrndx are sane (e.g., e_shoff + e_shnum * _SHDR_SIZE and e_shoff +
e_shstrndx * _SHDR_SIZE are <= len(elf_data)) and that each computed
shstr_offset, shstr_sh_offset and the range
shstr_sh_offset:shstr_sh_offset+shstr_sh_size are within the buffer; similarly,
in the section iteration that uses those values (the code around e_shoff/e_shnum
handling at lines ~121-131), check each section header offset and size before
reading (sh_offset/sh_size) and raise a clear ValueError including source_name
if any computed offset/length would read outside len(elf_data).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/ut/py/test_elf_parser.py`:
- Line 101: The inline comment on the assignment to text_bytes uses the Unicode
multiplication glyph '×' which triggers lint RUF003; update the comment in the
test_elf_parser.py line where text_bytes = b"\xd6\x5f\x03\xc0" * 4 to replace
'×' with plain ASCII 'x' (e.g., "RET x 4; arbitrary payload, never executed") so
the linter no longer complains while keeping the comment meaning intact.

---

Outside diff comments:
In `@simpler_setup/elf_parser.py`:
- Around line 103-111: Validate ELF buffer bounds before performing
struct.unpack/slices: ensure e_shoff, e_shnum and e_shstrndx are sane (e.g.,
e_shoff + e_shnum * _SHDR_SIZE and e_shoff + e_shstrndx * _SHDR_SIZE are <=
len(elf_data)) and that each computed shstr_offset, shstr_sh_offset and the
range shstr_sh_offset:shstr_sh_offset+shstr_sh_size are within the buffer;
similarly, in the section iteration that uses those values (the code around
e_shoff/e_shnum handling at lines ~121-131), check each section header offset
and size before reading (sh_offset/sh_size) and raise a clear ValueError
including source_name if any computed offset/length would read outside
len(elf_data).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee80a5de-15ec-496c-8b22-eac0741e1f7a

📥 Commits

Reviewing files that changed from the base of the PR and between d50fa68 and 78a5062.

📒 Files selected for processing (2)

simpler_setup/elf_parser.py
tests/ut/py/test_elf_parser.py

Fixes hw-native-sys#900 The AICore kernel loader (`simpler_setup/elf_parser.py`) silently dropped `.text._Z*` group sections (out-of-line template instantiations) and `.rela.text*` relocations when extracting a `.text` payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue hw-native-sys#831 / PR hw-native-sys#830) or silently-wrong partial output (issue hw-native-sys#900). Both symptoms are extremely hard to root-cause from the runtime error alone. This change is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the `.o` contains `.text._Z*` or any `.rela.text*` entries. The error names the offending sections and points at the `always_inline` kernel-side workaround. The literal-`.text` extraction path is otherwise unchanged — working kernels stay byte-identical (verified against the PA highperf `.o` from PR hw-native-sys#899 and the existing fully-inlined kernels). Loader-side relocation application is a separable follow-up; this PR just closes the silent-failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread simpler_setup/elf_parser.py

Comment thread simpler_setup/elf_parser.py

Comment thread simpler_setup/elf_parser.py

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/ut/py/test_elf_parser.py Outdated

hw-native-sys-bot force-pushed the fix/issue-900-aicore-elf-relocations branch from 78a5062 to c4ac0bf Compare June 1, 2026 08:06

ChaoWao approved these changes Jun 1, 2026

View reviewed changes

ChaoWao merged commit 411157e into hw-native-sys:main Jun 1, 2026
30 of 31 checks passed

ChaoWao deleted the fix/issue-900-aicore-elf-relocations branch June 1, 2026 08:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: detect unresolved AICore .text relocations and fail loud (#900)#958

Fix: detect unresolved AICore .text relocations and fail loud (#900)#958
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/issue-900-aicore-elf-relocations

hw-native-sys-bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented Jun 1, 2026

Summary

Scope (and what's out)

Test plan

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading