Rewrite JS parser and Segmenter: by tmihalac · Pull Request #255 · RHEcosystemAppEng/vulnerability-analysis

tmihalac · 2026-06-17T15:01:39Z

Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await)
- Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support
- Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source
- Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis
- Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method())

Fixes https://redhat.atlassian.net/browse/APPENG-5453

- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await) - Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support - Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source - Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis - Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method()) Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

vbelouso · 2026-06-17T15:02:03Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Code Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

tmihalac · 2026-06-18T12:33:26Z

/test-heavy

tmihalac · 2026-06-18T18:06:27Z

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

…S parser - Replace 3 catastrophic regex patterns (P4/P9/P11) in get_function_name with safe arrow_header check + simple name extraction, eliminating 12-38s hangs - Add arrow_header extension for destructured params like ({a,b}) => {body} where first '{' is in params, with guard to skip non-arrow functions - Use negative lookahead =(?!>) in P9 to distinguish assignment from arrow - Fix should_skip to only match top-level exclude dirs (dist/, build/) not nested ones (node_modules/*/dist/) - Add thread-safe double-checked locking for tree-sitter language init - Refactor class/method extraction to handle anonymous classes, generators, getters/setters, and export patterns - Fix is_function to use class\b word boundary instead of class\s+[\w$]+ - Strip optional chaining '?' from _get_function_calls results - Guard empty function names in CCA and create_map_of_local_vars with ValueError handling - Add direct export detection (export function/const) in is_function_exported - Add 387 JS parser tests and 139 JS segmenter tests covering backtracking regression, arrow functions, anonymous classes, and should_skip filtering Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

zvigrinberg

@tmihalac Looks very good in overall, performance wise, and supporting better a modern JS ESM Syntax.

Please see some minor comments.

zvigrinberg · 2026-06-22T05:01:12Z

+_JS_KEYWORDS = frozenset({
+    'if', 'for', 'while', 'switch', 'catch', 'with', 'do', 'else',
+    'return', 'throw', 'new', 'delete', 'typeof', 'void',
+    'try', 'finally', 'function', 'class', 'extends',
+    'break', 'continue', 'var', 'let', 'const', 'yield', 'await',
+    'debugger', 'instanceof',
+})
+


@tmihalac Not sure if it's critical, but anyway, You're missing here the following reserved keywords in js:

case
default
export
false
true
import
in
null
super
this

These keywords (default, case, export,
import, in, super, this) are deliberately excluded from _JS_KEYWORDS because
they're valid method names in JS (e.g., commander.default(), queryBuilder.in(),
serializer.export()).

@tmihalac these are not valid function names in javascript, only as methods under class.
But because it's cascading to it after checking functions with regex so i would say it is OK....
But it's also applies for some of the keywords in _JS_KEYWORDS set

Welcome to Node.js v22.22.2. Type ".help" for more information. > class Test { ... constructor(arg1, arg2) { ... this.arg1 = arg1; ... this.arg2 = arg2; ... } ... ... export() { ... return `This is a ${this.arg1} ${this.arg2}.`; ... } ... ... delete() { ... return "delete can be defined as method name" ; ... } ... ... with() { ... return "with can be defined as method name"; ... } ... ... } undefined > > let test = new Test(1,2) undefined > test.with() 'with can be defined as method name' > test.delete() 'delete can be defined as method name' >

So what do you suggest ?

I Would suggest to remove from that set keywords that possible as method names.

… guards - Fix _build_class_hierarchy regex to include $ in identifier char class ([\w$]+) - Pass code_documents to recursive _get_function_calls alias resolution calls - Add empty-name guard in CCA BFS to skip documents with unextractable function names - Add try/except + empty guard in function_name_extractor for robust name extraction - Use [-1] instead of [1] for get_package_names to handle variable-length returns - Simplify should_skip with direct index checks instead of loop - Add regex comments describing direct-call and callback-reference patterns - Add 8 tests: $-sign class hierarchy (6) and chained alias resolution (2) - Fix test assertions: generator star spacing, empty-string guard, jQuery dollar detection Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

TamarW0 · 2026-06-22T10:04:25Z


    def filter_docs_by_func_pkg_name(self, function_name: str, package_name: str, documents: list[Document]) -> list[Document]:
+        if not function_name:
+            return []


what case it solve? anonymous functions?

it returns empty string for anonymous functions or raises ValueError. The guard skips documents with empty names to avoid pointless iteration and to prevent a false match if function_name also happens to be empty ('' == '').

why do we parse those empty name functions?
cant we remove them earlier ?

This guard is on the search parameter function_name (what we're looking for), not on the documents themselves. If function_name is empty, page_content.__contains__('') is True for every document, so we'd return all documents as matches — a false positive flood. The guard short-circuits that. Anonymous functions with empty names in the document list are a separate concern — they're filtered by the empty-name guards in __find_initial_function and __find_caller_function_dfs.

I got it, and here I think it is a good solution
I am asking about the segmenter part
if we see a function that doesn't have name, why wont we just skip her?
do you think it is too much to change right now?

The segmenter could skip them by checking if the tree-sitter node has
no name child, but it's a raw code extractor — it doesn't have
function-naming logic. The cleaner place would be the parser's
is_function returning False for anonymous exports, but that also
affects simplify_code. Doable but has ripple effects

result in __find_initial_function Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

TamarW0 · 2026-06-22T11:20:39Z

@@ -2194,18 +2326,18 @@ def test_is_exportable_function(parser, function_name, file_content, expected, d
        True,
    ),
    (


that means that we will miss some imported packages
do we discover them in other way?

Can you be more specific, what is this comment related to?

to the following two changes (test result for dynamic import changed from True to False)

The expected result changed from True to False because of the
empty-identifier guard added at the top of is_package_imported.
Previously, an empty identifier with a matching package name reached
the dynamic import handler (line 867-878), where the else: return
True branch fired — meaning "package is dynamically imported, and
since no specific symbol was requested, that's enough." That was a
false positive: is_package_imported answers "is this symbol imported
from this package?", not "is this package referenced anywhere." With
an empty identifier, we can't confirm any specific symbol is
imported, so False is correct.

For import('lodash').then(mod => mod.template()), the mod binding is
local to the .then callback and not traceable by the parser —
returning True was giving CCA false confidence about an import chain
it couldn't verify.

TamarW0 · 2026-06-22T11:23:57Z

+# ============================================================================
+
+def test_get_function_calls_empty_name_returns_empty(parser):
+    """_get_function_calls must return [] when callee_function_name is empty (Fix 3a)."""


what is 3a?

My bad, will remove that

TamarW0 · 2026-06-22T11:25:46Z

+def test_is_package_imported_empty_identifier_returns_false(parser):
+    """is_package_imported must return False when identifier is empty (Fix 3e)."""
+    code = "import { template } from 'lodash';\nconst x = require('express');"
+    assert parser.is_package_imported(code, '', 'lodash') is False


again that means that we will miss some imported packages
do we discover them in other way?

identifier here is the function/symbol name (e.g. template), not the package name — is_package_imported(code, 'template', 'lodash') checks whether template is imported from lodash. When identifier is empty, there's no symbol to look for in import statements, so False is correct. In practice, is_package_imported is called from search_for_called_function where identifier is extracted from call expressions found in the caller's code (e.g. parts[0] from template() or parts[-2] from lodash.template()). The CCA also guards against empty function names earlier — __find_caller_functions_bfs (line 229-231) returns None if get_function_name is empty, so search_for_called_function is never reached with an empty callee name. The guard on is_package_imported is a defensive inner check.

I am not sure I understood your point
it is make sense to return True if the package is imported like the cases here
what if we want to search for lodash. template, it will return that lodash is not imported
what is the problem you wanted to solve here? do you have an example?
actually I should post the question in the logic changes and not in the test, because the question is on the logic itself

See reply #255 (comment)

TamarW0 · 2026-06-22T11:50:05Z

-    result = parser.is_package_imported(code_content, identifier, callee_package)
-    assert result == expected, f"{description}: expected {expected}, got {result}"
+def test_get_function_name_bare_assignment_function(parser, code, expected_name):
+    """Fix B: Bare assignment to function expression (no const/let/var)."""


what is fix refer to?

Removed the confusing Fix comment

- Add comment explaining _JS_KEYWORDS filters control-flow false positives, not invalid identifiers - Remove all "Fix X", "Fix A-I", and "Pattern N" references from test comments and docstrings - Replace internal references with behavioral descriptions of what each test verifies - Add warning log when skipping documents with empty function names in CCA __find_initial_function Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-06-22T13:47:52Z

/test-heavy

TamarW0 · 2026-06-22T22:35:01Z

-@pytest.mark.parametrize("function_content,expected_name", [
-    # Object methods now use //(class: name) pattern
+# ============================================================================
+# Tests for Fix C: Bare assignment arrow


you have some more Fix X comments, can you please clean them?

tmihalac · 2026-06-23T06:32:42Z

/test-heavy

tmihalac requested a review from zvigrinberg June 17, 2026 15:01

tmihalac requested a review from TamarW0 June 18, 2026 08:46

tmihalac added 2 commits June 21, 2026 10:36

Remove parser_threshold parameter from ExtendedLanguageParser creation

b247db4

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

zvigrinberg requested changes Jun 22, 2026

View reviewed changes

TamarW0 reviewed Jun 22, 2026

View reviewed changes

Comment thread src/exploit_iq_commons/utils/chain_of_calls_retriever.py

- Log source path of documents skipped due to empty get_function_name

1ea7d6e

result in __find_initial_function Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

TamarW0 reviewed Jun 22, 2026

View reviewed changes

Comment thread tests/test_javascript_functions_parser.py

TamarW0 reviewed Jun 22, 2026

View reviewed changes

Conversation

tmihalac commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbelouso commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

tmihalac commented Jun 18, 2026

Uh oh!

tmihalac commented Jun 18, 2026

Uh oh!

zvigrinberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TamarW0 Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

tmihalac commented Jun 17, 2026 •

edited

Loading

vbelouso commented Jun 17, 2026 •

edited

Loading

TamarW0 Jun 22, 2026 •

edited

Loading

tmihalac Jun 22, 2026 •

edited

Loading

tmihalac Jun 22, 2026 •

edited

Loading

tmihalac Jun 22, 2026 •

edited

Loading

tmihalac Jun 23, 2026 •

edited

Loading