Skip to content

fix(c-parser): deterministic include resolution + call-graph precision/recall#84

Open
gadievron wants to merge 1 commit into
masterfrom
fix/c-parser-deterministic-include-resolution-call-graph-precision
Open

fix(c-parser): deterministic include resolution + call-graph precision/recall#84
gadievron wants to merge 1 commit into
masterfrom
fix/c-parser-deterministic-include-resolution-call-graph-precision

Conversation

@gadievron
Copy link
Copy Markdown
Collaborator

c/call_graph_builder built include_map by basename/suffix match into an unordered
set, then resolved calls by first-of-set, and the regex fallback / call-name
extraction emitted several classes of false or missing edges. This repairs both
the include-resolution determinism and the resolver precision/recall in one change
to the C call-graph builder.

Deterministic, path-anchored include resolution:
Two compounding faults:
(1) bare endswith(inc) over-matched any tail (include "x.h" -> "src/prefix-x.h")
and every same-basename header repo-wide;
(2) first-match over the unordered set made the resolved callee depend on set
iteration order -> flipped across PYTHONHASHSEED.
Fix: require a path-component boundary (other_file == inc or endswith('/'+inc))
and iterate sorted(included_files) for a stable lexicographic tiebreak.

Resolver precision/recall:

  • _extract_calls_regex scanned raw code, so call-shaped tokens inside // or /* */
    comments or "..." / '...' literals became phantom edges. Blank out comments and
    string/char literals (length/newline preserving) before the regex scan.

  • obj->cb() (field_expression callee) was reduced to the bare field name and
    resolved against the global free-function index, wiring the member/function-pointer
    call to an unrelated free function. _extract_call_name now declines (returns None)
    for field_expression -> no false edge. The resolver has no receiver-type model, so
    name-only binding of a member call is wrong.

  • A function passed by name as a callback argument (qsort(..., my_cmp),
    pthread_create(..., worker, ...), signal(2, &handler)) produced no edge because
    only the call's 'function' child was inspected. Scan the 'arguments' child for
    bare-identifier / &name args that resolve to a known function and emit the
    caller->callback edge. Non-function identifiers do not resolve, so data args
    create no edge.

  • The unique-name (and included-header and prototype) fallbacks returned a static
    (file-local) function in another translation unit, violating C internal linkage.
    New _is_visible_from guard requires a cross-file candidate to be non-static;
    same-file definitions stay visible. is_static and file_path are already in the
    extractor output.

Out of scope (no code change): recording unresolved/ambiguous direct-name calls,
field-pointer derefs, and tbli subscript-expression callees into an
indirect_calls store — that sink does not exist in any of the parsers (0 producers /
0 consumers); building it end-to-end is out of scope. Dropping these calls emits no
false edge (the safe outcome).

Tests (hermetic): tests/parsers/c/test_c_include_map_determinism.py (2; over-match

  • {F_bar,F_foo} flip across PYTHONHASHSEED 0..9) and
    tests/parsers/c/test_c_call_resolution_precision.py (9). Full suite: 187 passed,
    63 skipped (no regression). ruff check -> All checks passed.

Scope: C-only (grep -c 'include_map[.*] = set()': c=1, python/php/ruby/zig=0).

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…n/recall

c/call_graph_builder built include_map by basename/suffix match into an unordered
set, then resolved calls by first-of-set, and the regex fallback / call-name
extraction emitted several classes of false or missing edges. This repairs both
the include-resolution determinism and the resolver precision/recall in one change
to the C call-graph builder.

Deterministic, path-anchored include resolution:
Two compounding faults:
(1) bare endswith(inc) over-matched any tail (include "x.h" -> "src/prefix-x.h")
    and every same-basename header repo-wide;
(2) first-match over the unordered set made the resolved callee depend on set
    iteration order -> flipped across PYTHONHASHSEED.
Fix: require a path-component boundary (other_file == inc or endswith('/'+inc))
and iterate sorted(included_files) for a stable lexicographic tiebreak.

Resolver precision/recall:

- _extract_calls_regex scanned raw code, so call-shaped tokens inside // or /* */
  comments or "..." / '...' literals became phantom edges. Blank out comments and
  string/char literals (length/newline preserving) before the regex scan.

- obj->cb() (field_expression callee) was reduced to the bare field name and
  resolved against the global free-function index, wiring the member/function-pointer
  call to an unrelated free function. _extract_call_name now declines (returns None)
  for field_expression -> no false edge. The resolver has no receiver-type model, so
  name-only binding of a member call is wrong.

- A function passed by name as a callback argument (qsort(..., my_cmp),
  pthread_create(..., worker, ...), signal(2, &handler)) produced no edge because
  only the call's 'function' child was inspected. Scan the 'arguments' child for
  bare-identifier / &name args that resolve to a known function and emit the
  caller->callback edge. Non-function identifiers do not resolve, so data args
  create no edge.

- The unique-name (and included-header and prototype) fallbacks returned a static
  (file-local) function in another translation unit, violating C internal linkage.
  New _is_visible_from guard requires a cross-file candidate to be non-static;
  same-file definitions stay visible. is_static and file_path are already in the
  extractor output.

Out of scope (no code change): recording unresolved/ambiguous direct-name calls,
field-pointer derefs, and tbl[i]() subscript-expression callees into an
indirect_calls store — that sink does not exist in any of the parsers (0 producers /
0 consumers); building it end-to-end is out of scope. Dropping these calls emits no
false edge (the safe outcome).

Tests (hermetic): tests/parsers/c/test_c_include_map_determinism.py (2; over-match
+ {F_bar,F_foo} flip across PYTHONHASHSEED 0..9) and
tests/parsers/c/test_c_call_resolution_precision.py (9). Full suite: 187 passed,
63 skipped (no regression). ruff check -> All checks passed.

Scope: C-only (grep -c 'include_map\[.*\] = set()': c=1, python/php/ruby/zig=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant