fix(c-parser): deterministic include resolution + call-graph precision/recall#84
Open
gadievron wants to merge 1 commit into
Open
Conversation
…n/recall
c/call_graph_builder built include_map by basename/suffix match into an unordered
set, then resolved calls by first-of-set, and the regex fallback / call-name
extraction emitted several classes of false or missing edges. This repairs both
the include-resolution determinism and the resolver precision/recall in one change
to the C call-graph builder.
Deterministic, path-anchored include resolution:
Two compounding faults:
(1) bare endswith(inc) over-matched any tail (include "x.h" -> "src/prefix-x.h")
and every same-basename header repo-wide;
(2) first-match over the unordered set made the resolved callee depend on set
iteration order -> flipped across PYTHONHASHSEED.
Fix: require a path-component boundary (other_file == inc or endswith('/'+inc))
and iterate sorted(included_files) for a stable lexicographic tiebreak.
Resolver precision/recall:
- _extract_calls_regex scanned raw code, so call-shaped tokens inside // or /* */
comments or "..." / '...' literals became phantom edges. Blank out comments and
string/char literals (length/newline preserving) before the regex scan.
- obj->cb() (field_expression callee) was reduced to the bare field name and
resolved against the global free-function index, wiring the member/function-pointer
call to an unrelated free function. _extract_call_name now declines (returns None)
for field_expression -> no false edge. The resolver has no receiver-type model, so
name-only binding of a member call is wrong.
- A function passed by name as a callback argument (qsort(..., my_cmp),
pthread_create(..., worker, ...), signal(2, &handler)) produced no edge because
only the call's 'function' child was inspected. Scan the 'arguments' child for
bare-identifier / &name args that resolve to a known function and emit the
caller->callback edge. Non-function identifiers do not resolve, so data args
create no edge.
- The unique-name (and included-header and prototype) fallbacks returned a static
(file-local) function in another translation unit, violating C internal linkage.
New _is_visible_from guard requires a cross-file candidate to be non-static;
same-file definitions stay visible. is_static and file_path are already in the
extractor output.
Out of scope (no code change): recording unresolved/ambiguous direct-name calls,
field-pointer derefs, and tbl[i]() subscript-expression callees into an
indirect_calls store — that sink does not exist in any of the parsers (0 producers /
0 consumers); building it end-to-end is out of scope. Dropping these calls emits no
false edge (the safe outcome).
Tests (hermetic): tests/parsers/c/test_c_include_map_determinism.py (2; over-match
+ {F_bar,F_foo} flip across PYTHONHASHSEED 0..9) and
tests/parsers/c/test_c_call_resolution_precision.py (9). Full suite: 187 passed,
63 skipped (no regression). ruff check -> All checks passed.
Scope: C-only (grep -c 'include_map\[.*\] = set()': c=1, python/php/ruby/zig=0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
c/call_graph_builder built include_map by basename/suffix match into an unordered
set, then resolved calls by first-of-set, and the regex fallback / call-name
extraction emitted several classes of false or missing edges. This repairs both
the include-resolution determinism and the resolver precision/recall in one change
to the C call-graph builder.
Deterministic, path-anchored include resolution:
Two compounding faults:
(1) bare endswith(inc) over-matched any tail (include "x.h" -> "src/prefix-x.h")
and every same-basename header repo-wide;
(2) first-match over the unordered set made the resolved callee depend on set
iteration order -> flipped across PYTHONHASHSEED.
Fix: require a path-component boundary (other_file == inc or endswith('/'+inc))
and iterate sorted(included_files) for a stable lexicographic tiebreak.
Resolver precision/recall:
_extract_calls_regex scanned raw code, so call-shaped tokens inside // or /* */
comments or "..." / '...' literals became phantom edges. Blank out comments and
string/char literals (length/newline preserving) before the regex scan.
obj->cb() (field_expression callee) was reduced to the bare field name and
resolved against the global free-function index, wiring the member/function-pointer
call to an unrelated free function. _extract_call_name now declines (returns None)
for field_expression -> no false edge. The resolver has no receiver-type model, so
name-only binding of a member call is wrong.
A function passed by name as a callback argument (qsort(..., my_cmp),
pthread_create(..., worker, ...), signal(2, &handler)) produced no edge because
only the call's 'function' child was inspected. Scan the 'arguments' child for
bare-identifier / &name args that resolve to a known function and emit the
caller->callback edge. Non-function identifiers do not resolve, so data args
create no edge.
The unique-name (and included-header and prototype) fallbacks returned a static
(file-local) function in another translation unit, violating C internal linkage.
New _is_visible_from guard requires a cross-file candidate to be non-static;
same-file definitions stay visible. is_static and file_path are already in the
extractor output.
Out of scope (no code change): recording unresolved/ambiguous direct-name calls,
field-pointer derefs, and tbli subscript-expression callees into an
indirect_calls store — that sink does not exist in any of the parsers (0 producers /
0 consumers); building it end-to-end is out of scope. Dropping these calls emits no
false edge (the safe outcome).
Tests (hermetic): tests/parsers/c/test_c_include_map_determinism.py (2; over-match
tests/parsers/c/test_c_call_resolution_precision.py (9). Full suite: 187 passed,
63 skipped (no regression). ruff check -> All checks passed.
Scope: C-only (grep -c 'include_map[.*] = set()': c=1, python/php/ruby/zig=0).
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com