Skip to content

Rewrite JS parser and Segmenter:#255

Open
tmihalac wants to merge 6 commits into
RHEcosystemAppEng:mainfrom
tmihalac:rewrite-js-segmenter-and-parsers
Open

Rewrite JS parser and Segmenter:#255
tmihalac wants to merge 6 commits into
RHEcosystemAppEng:mainfrom
tmihalac:rewrite-js-segmenter-and-parsers

Conversation

@tmihalac

@tmihalac tmihalac commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator
  • Replaced esprima-based JavaScript segmenter with tree-sitter for reliable parsing of modern JS syntax (optional chaining, nullish coalescing, top-level await)
    • Fixed JS function name extraction: keyword filtering, position-aware matching, redundant pattern removal, generator/TypeScript/anonymous-export support
    • Added build-artifact filtering (should_skip) that excludes app-level dist/, build/static/, .min.js while preserving node_modules/*/dist/ as legitimate third-party source
    • Added empty-name guards in CCA BFS to prevent documents with unextractable function names from entering call-chain analysis
    • Fixed _get_function_calls regex to detect calls through optional chaining (obj?.method())

Fixes https://redhat.atlassian.net/browse/APPENG-5453

- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable
  parsing of modern JS syntax (optional chaining, nullish coalescing, top-level
  await)
  - Fixed JS function name extraction: keyword filtering, position-aware matching,
  redundant pattern removal, generator/TypeScript/anonymous-export support
  - Added build-artifact filtering (should_skip) that excludes app-level dist/,
  build/static/, .min.js while preserving node_modules/*/dist/ as legitimate
  third-party source
  - Added empty-name guards in CCA BFS to prevent documents with unextractable
  function names from entering call-chain analysis
  - Fixed _get_function_calls regex to detect calls through optional chaining
  (obj?.method())

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac tmihalac requested a review from zvigrinberg June 17, 2026 15:01
@vbelouso

vbelouso commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@tmihalac tmihalac requested a review from TamarW0 June 18, 2026 08:46
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

1 similar comment
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

tmihalac added 2 commits June 21, 2026 10:36
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
…S parser

  - Replace 3 catastrophic regex patterns (P4/P9/P11) in get_function_name with
    safe arrow_header check + simple name extraction, eliminating 12-38s hangs
  - Add arrow_header extension for destructured params like ({a,b}) => {body}
    where first '{' is in params, with guard to skip non-arrow functions
  - Use negative lookahead =(?!>) in P9 to distinguish assignment from arrow
  - Fix should_skip to only match top-level exclude dirs (dist/, build/) not
    nested ones (node_modules/*/dist/)
  - Add thread-safe double-checked locking for tree-sitter language init
  - Refactor class/method extraction to handle anonymous classes, generators,
    getters/setters, and export patterns
  - Fix is_function to use class\b word boundary instead of class\s+[\w$]+
  - Strip optional chaining '?' from _get_function_calls results
  - Guard empty function names in CCA and create_map_of_local_vars with
    ValueError handling
  - Add direct export detection (export function/const) in is_function_exported
  - Add 387 JS parser tests and 139 JS segmenter tests covering backtracking
    regression, arrow functions, anonymous classes, and should_skip filtering

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

@zvigrinberg zvigrinberg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Looks very good in overall, performance wise, and supporting better a modern JS ESM Syntax.

Please see some minor comments.

Comment on lines +26 to +33
_JS_KEYWORDS = frozenset({
'if', 'for', 'while', 'switch', 'catch', 'with', 'do', 'else',
'return', 'throw', 'new', 'delete', 'typeof', 'void',
'try', 'finally', 'function', 'class', 'extends',
'break', 'continue', 'var', 'let', 'const', 'yield', 'await',
'debugger', 'instanceof',
})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Not sure if it's critical, but anyway, You're missing here the following reserved keywords in js:

case
default
export
false
true
import
in
null
super
this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These keywords (default, case, export,
import, in, super, this) are deliberately excluded from _JS_KEYWORDS because
they're valid method names in JS (e.g., commander.default(), queryBuilder.in(),
serializer.export()).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac these are not valid function names in javascript, only as methods under class.
But because it's cascading to it after checking functions with regex so i would say it is OK....
But it's also applies for some of the keywords in _JS_KEYWORDS set

Welcome to Node.js v22.22.2.
Type ".help" for more information.
> class Test {
...   constructor(arg1, arg2) {
...     this.arg1 = arg1;
...     this.arg2 = arg2;
...   }
... 
...   export() {
...     return `This is a ${this.arg1} ${this.arg2}.`;
...   }
... 
...   delete() {
...     return "delete can be defined as method name" ;
...   }
...   
...   with() {
...     return "with can be defined as method name";
...   }
...   
... }
undefined
> 
> let test = new Test(1,2)
undefined
> test.with()
'with can be defined as method name'
> test.delete()
'delete can be defined as method name'
> 

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do you suggest ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Would suggest to remove from that set keywords that possible as method names.

Comment thread src/exploit_iq_commons/utils/javascript_extended_segmenter.py Outdated
Comment thread src/exploit_iq_commons/utils/document_embedding.py
… guards

  - Fix _build_class_hierarchy regex to include $ in identifier char class ([\w$]+)
  - Pass code_documents to recursive _get_function_calls alias resolution calls
  - Add empty-name guard in CCA BFS to skip documents with unextractable function
  names
  - Add try/except + empty guard in function_name_extractor for robust name
  extraction
  - Use [-1] instead of [1] for get_package_names to handle variable-length returns
  - Simplify should_skip with direct index checks instead of loop
  - Add regex comments describing direct-call and callback-reference patterns
  - Add 8 tests: $-sign class hierarchy (6) and chained alias resolution (2)
  - Fix test assertions: generator star spacing, empty-string guard, jQuery dollar
  detection

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

def filter_docs_by_func_pkg_name(self, function_name: str, package_name: str, documents: list[Document]) -> list[Document]:
if not function_name:
return []

@TamarW0 TamarW0 Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what case it solve? anonymous functions?

@tmihalac tmihalac Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 it returns empty
  string for anonymous functions or raises ValueError. The guard skips
  documents with empty names to avoid pointless iteration and to 
  prevent a false match if function_name also happens to be empty ('' 
  == '').

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we parse those empty name functions?
cant we remove them earlier ?

@tmihalac tmihalac Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  This guard is on the search parameter function_name (what we're
  looking for), not on the documents themselves. If function_name is
  empty, page_content.__contains__('') is True for every document, so
  we'd return all documents as matches — a false positive flood. The
  guard short-circuits that.

  Anonymous functions with empty names in the document list are a
  separate concern — they're filtered by the empty-name guards in
  __find_initial_function and __find_caller_function_dfs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, and here I think it is a good solution
I am asking about the segmenter part
if we see a function that doesn't have name, why wont we just skip her?
do you think it is too much to change right now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segmenter could skip them by checking if the tree-sitter node has
no name child, but it's a raw code extractor — it doesn't have
function-naming logic. The cleaner place would be the parser's
is_function returning False for anonymous exports, but that also
affects simplify_code. Doable but has ripple effects

Comment thread src/exploit_iq_commons/utils/chain_of_calls_retriever.py
  result in __find_initial_function

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@@ -2194,18 +2326,18 @@ def test_is_exportable_function(parser, function_name, file_content, expected, d
True,
),
(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means that we will miss some imported packages
do we discover them in other way?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific, what is this comment related to?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the following two changes (test result for dynamic import changed from True to False)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected result changed from True to False because of the
empty-identifier guard added at the top of is_package_imported.
Previously, an empty identifier with a matching package name reached
the dynamic import handler (line 867-878), where the else: return
True branch fired — meaning "package is dynamically imported, and
since no specific symbol was requested, that's enough." That was a
false positive: is_package_imported answers "is this symbol imported
from this package?", not "is this package referenced anywhere." With
an empty identifier, we can't confirm any specific symbol is
imported, so False is correct.

For import('lodash').then(mod => mod.template()), the mod binding is
local to the .then callback and not traceable by the parser —
returning True was giving CCA false confidence about an import chain
it couldn't verify.

# ============================================================================

def test_get_function_calls_empty_name_returns_empty(parser):
"""_get_function_calls must return [] when callee_function_name is empty (Fix 3a)."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is 3a?

@tmihalac tmihalac Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, will remove that

def test_is_package_imported_empty_identifier_returns_false(parser):
"""is_package_imported must return False when identifier is empty (Fix 3e)."""
code = "import { template } from 'lodash';\nconst x = require('express');"
assert parser.is_package_imported(code, '', 'lodash') is False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again that means that we will miss some imported packages
do we discover them in other way?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier here is the function/symbol name (e.g. template), not the
  package name — is_package_imported(code, 'template', 'lodash') checks
  whether template is imported from lodash. When identifier is empty,
  there's no symbol to look for in import statements, so False is
  correct.

  In practice, is_package_imported is called from
  search_for_called_function where identifier is extracted from call
  expressions found in the caller's code (e.g. parts[0] from template()
  or parts[-2] from lodash.template()). The CCA also guards against
  empty function names earlier — __find_caller_functions_bfs (line
  229-231) returns None if get_function_name is empty, so
  search_for_called_function is never reached with an empty callee
  name. The guard on is_package_imported is a defensive inner check.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understood your point
it is make sense to return True if the package is imported like the cases here
what if we want to search for lodash. template, it will return that lodash is not imported
what is the problem you wanted to solve here? do you have an example?
actually I should post the question in the logic changes and not in the test, because the question is on the logic itself

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply #255 (comment)

Comment thread tests/test_javascript_functions_parser.py
result = parser.is_package_imported(code_content, identifier, callee_package)
assert result == expected, f"{description}: expected {expected}, got {result}"
def test_get_function_name_bare_assignment_function(parser, code, expected_name):
"""Fix B: Bare assignment to function expression (no const/let/var)."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is fix refer to?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the confusing Fix comment

  - Add comment explaining _JS_KEYWORDS filters control-flow false
  positives, not invalid identifiers
  - Remove all "Fix X", "Fix A-I", and "Pattern N" references from test
  comments and docstrings
  - Replace internal references with behavioral descriptions of what
  each test verifies
  - Add warning log when skipping documents with empty function names
  in CCA __find_initial_function

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

@pytest.mark.parametrize("function_content,expected_name", [
# Object methods now use //(class: name) pattern
# ============================================================================
# Tests for Fix C: Bare assignment arrow

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have some more Fix X comments, can you please clean them?

@tmihalac tmihalac Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants