Fix: detect unresolved AICore .text relocations and fail loud (#900)#958
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughELF loader validation is strengthened to reject ChangesELF .text Extraction Validation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the ELF parser in simpler_setup/elf_parser.py to detect and reject ELF files with out-of-line code sections or unresolved relocations, preventing silent failures on AICore. It also introduces comprehensive unit tests in tests/ut/py/test_elf_parser.py to validate these changes. The review feedback suggests critical security improvements to prevent out-of-bounds reads by adding bounds checks on section header indices, string table offsets, and section data boundaries.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
simpler_setup/elf_parser.py (1)
103-111:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd explicit ELF bounds validation before header/section reads.
Line 108 and Lines 121-126 trust
e_shoff/e_shnum/e_shstrndxand per-section offsets without range checks. A truncated or malformed object can fail with low-level unpack/slice errors instead of a clearValueErrortied tosource_name.Proposed fix
def _extract_text_elf64(elf_data: bytes, source_name: str) -> bytes: @@ e_shoff = struct.unpack("<Q", elf_data[40:48])[0] e_shnum = struct.unpack("<H", elf_data[60:62])[0] e_shstrndx = struct.unpack("<H", elf_data[62:64])[0] + + section_table_end = e_shoff + e_shnum * _SHDR_SIZE + if e_shnum == 0 or section_table_end > len(elf_data): + raise ValueError(f"Invalid ELF section table in: {source_name}") + if e_shstrndx >= e_shnum: + raise ValueError(f"Invalid ELF shstrndx in: {source_name}") @@ shstr_offset = e_shoff + e_shstrndx * _SHDR_SIZE + if shstr_offset + _SHDR_SIZE > len(elf_data): + raise ValueError(f"Invalid ELF shstr section header in: {source_name}") shstr_sh_offset = struct.unpack("<Q", elf_data[shstr_offset + 24 : shstr_offset + 32])[0] shstr_sh_size = struct.unpack("<Q", elf_data[shstr_offset + 32 : shstr_offset + 40])[0] + if shstr_sh_offset + shstr_sh_size > len(elf_data): + raise ValueError(f"Invalid ELF shstr table range in: {source_name}") @@ for i in range(e_shnum): section_offset = e_shoff + i * _SHDR_SIZE + if section_offset + _SHDR_SIZE > len(elf_data): + raise ValueError(f"Invalid ELF section header range in: {source_name}") @@ if sh_type == _SHT_PROGBITS and section_name == ".text": + if sh_offset + sh_size > len(elf_data): + raise ValueError(f"Invalid .text section range in: {source_name}") text_data = elf_data[sh_offset : sh_offset + sh_size]Also applies to: 121-131
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@simpler_setup/elf_parser.py` around lines 103 - 111, Validate ELF buffer bounds before performing struct.unpack/slices: ensure e_shoff, e_shnum and e_shstrndx are sane (e.g., e_shoff + e_shnum * _SHDR_SIZE and e_shoff + e_shstrndx * _SHDR_SIZE are <= len(elf_data)) and that each computed shstr_offset, shstr_sh_offset and the range shstr_sh_offset:shstr_sh_offset+shstr_sh_size are within the buffer; similarly, in the section iteration that uses those values (the code around e_shoff/e_shnum handling at lines ~121-131), check each section header offset and size before reading (sh_offset/sh_size) and raise a clear ValueError including source_name if any computed offset/length would read outside len(elf_data).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/ut/py/test_elf_parser.py`:
- Line 101: The inline comment on the assignment to text_bytes uses the Unicode
multiplication glyph '×' which triggers lint RUF003; update the comment in the
test_elf_parser.py line where text_bytes = b"\xd6\x5f\x03\xc0" * 4 to replace
'×' with plain ASCII 'x' (e.g., "RET x 4; arbitrary payload, never executed") so
the linter no longer complains while keeping the comment meaning intact.
---
Outside diff comments:
In `@simpler_setup/elf_parser.py`:
- Around line 103-111: Validate ELF buffer bounds before performing
struct.unpack/slices: ensure e_shoff, e_shnum and e_shstrndx are sane (e.g.,
e_shoff + e_shnum * _SHDR_SIZE and e_shoff + e_shstrndx * _SHDR_SIZE are <=
len(elf_data)) and that each computed shstr_offset, shstr_sh_offset and the
range shstr_sh_offset:shstr_sh_offset+shstr_sh_size are within the buffer;
similarly, in the section iteration that uses those values (the code around
e_shoff/e_shnum handling at lines ~121-131), check each section header offset
and size before reading (sh_offset/sh_size) and raise a clear ValueError
including source_name if any computed offset/length would read outside
len(elf_data).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ee80a5de-15ec-496c-8b22-eac0741e1f7a
📒 Files selected for processing (2)
simpler_setup/elf_parser.pytests/ut/py/test_elf_parser.py
Fixes hw-native-sys#900 The AICore kernel loader (`simpler_setup/elf_parser.py`) silently dropped `.text._Z*` group sections (out-of-line template instantiations) and `.rela.text*` relocations when extracting a `.text` payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue hw-native-sys#831 / PR hw-native-sys#830) or silently-wrong partial output (issue hw-native-sys#900). Both symptoms are extremely hard to root-cause from the runtime error alone. This change is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the `.o` contains `.text._Z*` or any `.rela.text*` entries. The error names the offending sections and points at the `always_inline` kernel-side workaround. The literal-`.text` extraction path is otherwise unchanged — working kernels stay byte-identical (verified against the PA highperf `.o` from PR hw-native-sys#899 and the existing fully-inlined kernels). Loader-side relocation application is a separable follow-up; this PR just closes the silent-failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
78a5062 to
c4ac0bf
Compare
Summary
simpler_setup/elf_parser.py::extract_text_section) silently dropped.text._Z*group sections (out-of-line template instantiations) and.rela.text*relocations. The unresolvedBL/Btargets in.textthen branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue [Bug] A2A3 triangular inverse AIC kernel times out intensormap_and_ringbufferruntime #831 / PR Triangular Inverse Kernel (continuation of Zouzias' impl) #830) or silently-wrong partial output (issue [Bug] A2A3spmd_paged_attention_highperfhardware run times out or produces partial zero output whilea2a3simpasses #900). Both symptoms are hard to root-cause from the runtime error alone..ocontains.text._Z*or any.rela.text*entries. The error names the offending sections and points at the__attribute__((always_inline))kernel-side workaround..textextraction path is unchanged. Verified against the PA highperf.ofrom PR High performance Paged Attention A2A3 ST Test #899 (97156 / 64804 bytes for AIC / AIV, identical to before) and the existing fully-inlined kernels.Scope (and what's out)
.text._Z*, applyR_AARCH64_CALL26/JUMP26) — separable follow-up. Drafted but not landing in this PR to keep the change minimal and reviewable.spmd_paged_attention_highperfhardware run times out or produces partial zero output whilea2a3simpasses #900 also describes a partial-zero output symptom in PR High performance Paged Attention A2A3 ST Test #899's PA test on a2a3 hardware. Investigation in this branch shows the current PA.odoes NOT have.text._Z*or.rela.text(every templated method is already markedalways_inline), so that partial-zero symptom is not the loader gap. It's a different bug — likely SPMD work-partitioning or FFTS synchronization on the AIV side — and will need its own investigation.Test plan
tests/ut/py/test_elf_parser.py— 7 cases: byte-identical regression guard for the common path, three rejection cases for.text._Zfoo(raises with issue link, names the offending section, mentions thealways_inlineworkaround), two for.rela.text(raises with issue link, reports entry count), and the missing-.textcase..oround-trip: compiledtests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/aic/paged_attention_highperf.cpp(both AIC and AIV variants) viakernel_compiler, confirmedextract_text_sectionreturns byte-identical payload (97156 / 64804 bytes).always_inlineannotations and verify the new error fires with actionable message.🤖 Generated with Claude Code