Skip to content

fix(parsers/python): segment-match path exclusion/classification + resolve relative-import anchors#90

Open
gadievron wants to merge 2 commits into
masterfrom
fix/parsers-python-segment-match-path-exclusion-classification-resolve
Open

fix(parsers/python): segment-match path exclusion/classification + resolve relative-import anchors#90
gadievron wants to merge 2 commits into
masterfrom
fix/parsers-python-segment-match-path-exclusion-classification-resolve

Conversation

@gadievron
Copy link
Copy Markdown
Collaborator

Three independent defects in parsers/python/function_extractor.py:

  1. extract_all(): the no-args scan excluded files with
    any(excl in str(file_path) for excl in [...]) -- an unanchored substring test on the full path, so
    a file whose path merely contains a token ('myvenv/keep.py' contains 'venv') was silently dropped,
    and an ancestor directory containing a token could exclude the whole scan. Now matches whole path
    SEGMENTS: {tokens} & set(file_path.relative_to(repo_path).parts). Python's own token set
    (pycache/.git/venv/.venv/node_modules) is preserved.

  2. classify_function(): classification used '<token>' in path_lower substring tests, so
    'interviews/api.py' was classified 'view_function'. 'view_function' is in
    entry_point_detector.ENTRY_POINT_TYPES (:26-32), so that misclassification became a false entry-point
    seed that cascades into false reachability (consumed at entry_point_detector.py:177). The 'views'
    token now matches a whole path segment via a new path_has_segment helper. The 'middleware' token is
    given the same segment fix because it shares the substring defect, but note 'middleware' (the python
    label) is NOT in ENTRY_POINT_TYPES -- so that half is classification accuracy, not a reachability
    change. The 'test' classifier is left as a substring on purpose (test-file conventions use 'tests/'
    and 'test
    '/'_test' forms a segment match would miss; 'test' is not an entry-point type, so it
    seeds no false reachability).

  3. extract_imports(): the ast.ImportFrom branch read node.module but never node.level, so relative
    imports lost their package anchor ('from . import X' stored bare 'X'; 'from ..pkg import Y' stored
    anchor-less 'pkg.Y'). call_graph_builder._resolve_import then rebuilt a wrong/no file path and the
    edges were dropped (verified: pre-fix the candidate resolves to None, post-fix it resolves to the real
    pkg/sub/helpers.py). Now reconstructs the absolute anchor from the importing file's package location
    (level=1 -> own package, level=2 -> parent, ...); over-deep levels degrade to no leading dot. Absolute
    imports (level=0) are unchanged.

Scope: the php/ruby function_extractor.py extract_all + classify siblings carry related defects and are
not widened here.

Tests: tests/test_python_function_extractor.py -- loads the module under a unique importlib name (the
bare 'function_extractor' name is shared by five other parsers, so a plain import would pollute
sys.modules for the rest of the suite). Three checks: segment-vs-substring exclusion, entry-point
classification by segment, and relative-import anchor reconstruction. RED 3 failed (pre-fix) -> GREEN 3
passed; full suite 179 passed / 63 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…solve relative-import anchors

Three independent defects in parsers/python/function_extractor.py:

1. extract_all(): the no-args scan excluded files with
   `any(excl in str(file_path) for excl in [...])` -- an unanchored substring test on the full path, so
   a file whose path merely contains a token ('myvenv/keep.py' contains 'venv') was silently dropped,
   and an ancestor directory containing a token could exclude the whole scan. Now matches whole path
   SEGMENTS: `{tokens} & set(file_path.relative_to(repo_path).parts)`. Python's own token set
   (__pycache__/.git/venv/.venv/node_modules) is preserved.

2. classify_function(): classification used `'<token>' in path_lower` substring tests, so
   'interviews/api.py' was classified 'view_function'. 'view_function' is in
   entry_point_detector.ENTRY_POINT_TYPES (:26-32), so that misclassification became a false entry-point
   seed that cascades into false reachability (consumed at entry_point_detector.py:177). The 'views'
   token now matches a whole path segment via a new _path_has_segment helper. The 'middleware' token is
   given the same segment fix because it shares the substring defect, but note 'middleware' (the python
   label) is NOT in ENTRY_POINT_TYPES -- so that half is classification accuracy, not a reachability
   change. The 'test' classifier is left as a substring on purpose (test-file conventions use 'tests/'
   and 'test_*'/'*_test' forms a segment match would miss; 'test' is not an entry-point type, so it
   seeds no false reachability).

3. extract_imports(): the ast.ImportFrom branch read node.module but never node.level, so relative
   imports lost their package anchor ('from . import X' stored bare 'X'; 'from ..pkg import Y' stored
   anchor-less 'pkg.Y'). call_graph_builder._resolve_import then rebuilt a wrong/no file path and the
   edges were dropped (verified: pre-fix the candidate resolves to None, post-fix it resolves to the real
   pkg/sub/helpers.py). Now reconstructs the absolute anchor from the importing file's package location
   (level=1 -> own package, level=2 -> parent, ...); over-deep levels degrade to no leading dot. Absolute
   imports (level=0) are unchanged.

Scope: the php/ruby function_extractor.py extract_all + classify siblings carry related defects and are
not widened here.

Tests: tests/test_python_function_extractor.py -- loads the module under a unique importlib name (the
bare 'function_extractor' name is shared by five other parsers, so a plain import would pollute
sys.modules for the rest of the suite). Three checks: segment-vs-substring exclusion, entry-point
classification by segment, and relative-import anchor reconstruction. RED 3 failed (pre-fix) -> GREEN 3
passed; full suite 179 passed / 63 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Windows CI

The segment-vs-substring exclusion regression test recorded the processed path
via str(Path.relative_to(...)), which yields backslash separators on Windows
and fails the forward-slash 'in seen' assertions. Use .as_posix() so the
comparison is OS-independent. The substring-over-exclusion assertions are
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant