Rewrite JS parser and Segmenter:#255
Conversation
- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await) - Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support - Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source - Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis - Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method()) Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
/test-heavy |
1 similar comment
|
/test-heavy |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
…S parser
- Replace 3 catastrophic regex patterns (P4/P9/P11) in get_function_name with
safe arrow_header check + simple name extraction, eliminating 12-38s hangs
- Add arrow_header extension for destructured params like ({a,b}) => {body}
where first '{' is in params, with guard to skip non-arrow functions
- Use negative lookahead =(?!>) in P9 to distinguish assignment from arrow
- Fix should_skip to only match top-level exclude dirs (dist/, build/) not
nested ones (node_modules/*/dist/)
- Add thread-safe double-checked locking for tree-sitter language init
- Refactor class/method extraction to handle anonymous classes, generators,
getters/setters, and export patterns
- Fix is_function to use class\b word boundary instead of class\s+[\w$]+
- Strip optional chaining '?' from _get_function_calls results
- Guard empty function names in CCA and create_map_of_local_vars with
ValueError handling
- Add direct export detection (export function/const) in is_function_exported
- Add 387 JS parser tests and 139 JS segmenter tests covering backtracking
regression, arrow functions, anonymous classes, and should_skip filtering
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
zvigrinberg
left a comment
There was a problem hiding this comment.
@tmihalac Looks very good in overall, performance wise, and supporting better a modern JS ESM Syntax.
Please see some minor comments.
| _JS_KEYWORDS = frozenset({ | ||
| 'if', 'for', 'while', 'switch', 'catch', 'with', 'do', 'else', | ||
| 'return', 'throw', 'new', 'delete', 'typeof', 'void', | ||
| 'try', 'finally', 'function', 'class', 'extends', | ||
| 'break', 'continue', 'var', 'let', 'const', 'yield', 'await', | ||
| 'debugger', 'instanceof', | ||
| }) | ||
|
|
There was a problem hiding this comment.
@tmihalac Not sure if it's critical, but anyway, You're missing here the following reserved keywords in js:
case
default
export
false
true
import
in
null
super
this
There was a problem hiding this comment.
These keywords (default, case, export,
import, in, super, this) are deliberately excluded from _JS_KEYWORDS because
they're valid method names in JS (e.g., commander.default(), queryBuilder.in(),
serializer.export()).
There was a problem hiding this comment.
@tmihalac these are not valid function names in javascript, only as methods under class.
But because it's cascading to it after checking functions with regex so i would say it is OK....
But it's also applies for some of the keywords in _JS_KEYWORDS set
Welcome to Node.js v22.22.2.
Type ".help" for more information.
> class Test {
... constructor(arg1, arg2) {
... this.arg1 = arg1;
... this.arg2 = arg2;
... }
...
... export() {
... return `This is a ${this.arg1} ${this.arg2}.`;
... }
...
... delete() {
... return "delete can be defined as method name" ;
... }
...
... with() {
... return "with can be defined as method name";
... }
...
... }
undefined
>
> let test = new Test(1,2)
undefined
> test.with()
'with can be defined as method name'
> test.delete()
'delete can be defined as method name'
> There was a problem hiding this comment.
So what do you suggest ?
There was a problem hiding this comment.
I Would suggest to remove from that set keywords that possible as method names.
… guards - Fix _build_class_hierarchy regex to include $ in identifier char class ([\w$]+) - Pass code_documents to recursive _get_function_calls alias resolution calls - Add empty-name guard in CCA BFS to skip documents with unextractable function names - Add try/except + empty guard in function_name_extractor for robust name extraction - Use [-1] instead of [1] for get_package_names to handle variable-length returns - Simplify should_skip with direct index checks instead of loop - Add regex comments describing direct-call and callback-reference patterns - Add 8 tests: $-sign class hierarchy (6) and chained alias resolution (2) - Fix test assertions: generator star spacing, empty-string guard, jQuery dollar detection Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
|
||
| def filter_docs_by_func_pkg_name(self, function_name: str, package_name: str, documents: list[Document]) -> list[Document]: | ||
| if not function_name: | ||
| return [] |
There was a problem hiding this comment.
what case it solve? anonymous functions?
There was a problem hiding this comment.
it returns empty
string for anonymous functions or raises ValueError. The guard skips
documents with empty names to avoid pointless iteration and to
prevent a false match if function_name also happens to be empty (''
== '').
There was a problem hiding this comment.
why do we parse those empty name functions?
cant we remove them earlier ?
There was a problem hiding this comment.
This guard is on the search parameter function_name (what we're
looking for), not on the documents themselves. If function_name is
empty, page_content.__contains__('') is True for every document, so
we'd return all documents as matches — a false positive flood. The
guard short-circuits that.
Anonymous functions with empty names in the document list are a
separate concern — they're filtered by the empty-name guards in
__find_initial_function and __find_caller_function_dfs.
There was a problem hiding this comment.
I got it, and here I think it is a good solution
I am asking about the segmenter part
if we see a function that doesn't have name, why wont we just skip her?
do you think it is too much to change right now?
There was a problem hiding this comment.
The segmenter could skip them by checking if the tree-sitter node has
no name child, but it's a raw code extractor — it doesn't have
function-naming logic. The cleaner place would be the parser's
is_function returning False for anonymous exports, but that also
affects simplify_code. Doable but has ripple effects
result in __find_initial_function Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
| @@ -2194,18 +2326,18 @@ def test_is_exportable_function(parser, function_name, file_content, expected, d | |||
| True, | |||
| ), | |||
| ( | |||
There was a problem hiding this comment.
that means that we will miss some imported packages
do we discover them in other way?
There was a problem hiding this comment.
Can you be more specific, what is this comment related to?
There was a problem hiding this comment.
to the following two changes (test result for dynamic import changed from True to False)
There was a problem hiding this comment.
The expected result changed from True to False because of the
empty-identifier guard added at the top of is_package_imported.
Previously, an empty identifier with a matching package name reached
the dynamic import handler (line 867-878), where the else: return
True branch fired — meaning "package is dynamically imported, and
since no specific symbol was requested, that's enough." That was a
false positive: is_package_imported answers "is this symbol imported
from this package?", not "is this package referenced anywhere." With
an empty identifier, we can't confirm any specific symbol is
imported, so False is correct.
For import('lodash').then(mod => mod.template()), the mod binding is
local to the .then callback and not traceable by the parser —
returning True was giving CCA false confidence about an import chain
it couldn't verify.
| # ============================================================================ | ||
|
|
||
| def test_get_function_calls_empty_name_returns_empty(parser): | ||
| """_get_function_calls must return [] when callee_function_name is empty (Fix 3a).""" |
There was a problem hiding this comment.
My bad, will remove that
| def test_is_package_imported_empty_identifier_returns_false(parser): | ||
| """is_package_imported must return False when identifier is empty (Fix 3e).""" | ||
| code = "import { template } from 'lodash';\nconst x = require('express');" | ||
| assert parser.is_package_imported(code, '', 'lodash') is False |
There was a problem hiding this comment.
again that means that we will miss some imported packages
do we discover them in other way?
There was a problem hiding this comment.
identifier here is the function/symbol name (e.g. template), not the
package name — is_package_imported(code, 'template', 'lodash') checks
whether template is imported from lodash. When identifier is empty,
there's no symbol to look for in import statements, so False is
correct.
In practice, is_package_imported is called from
search_for_called_function where identifier is extracted from call
expressions found in the caller's code (e.g. parts[0] from template()
or parts[-2] from lodash.template()). The CCA also guards against
empty function names earlier — __find_caller_functions_bfs (line
229-231) returns None if get_function_name is empty, so
search_for_called_function is never reached with an empty callee
name. The guard on is_package_imported is a defensive inner check.
There was a problem hiding this comment.
I am not sure I understood your point
it is make sense to return True if the package is imported like the cases here
what if we want to search for lodash. template, it will return that lodash is not imported
what is the problem you wanted to solve here? do you have an example?
actually I should post the question in the logic changes and not in the test, because the question is on the logic itself
| result = parser.is_package_imported(code_content, identifier, callee_package) | ||
| assert result == expected, f"{description}: expected {expected}, got {result}" | ||
| def test_get_function_name_bare_assignment_function(parser, code, expected_name): | ||
| """Fix B: Bare assignment to function expression (no const/let/var).""" |
There was a problem hiding this comment.
Removed the confusing Fix comment
- Add comment explaining _JS_KEYWORDS filters control-flow false positives, not invalid identifiers - Remove all "Fix X", "Fix A-I", and "Pattern N" references from test comments and docstrings - Replace internal references with behavioral descriptions of what each test verifies - Add warning log when skipping documents with empty function names in CCA __find_initial_function Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
| @pytest.mark.parametrize("function_content,expected_name", [ | ||
| # Object methods now use //(class: name) pattern | ||
| # ============================================================================ | ||
| # Tests for Fix C: Bare assignment arrow |
There was a problem hiding this comment.
you have some more Fix X comments, can you please clean them?
|
/test-heavy |
Fixes https://redhat.atlassian.net/browse/APPENG-5453