perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599
perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Conversation
1dae3f3 to
66dff99
Compare
daf81f7 to
f263f08
Compare
| // efficient bottom-up traversal. | ||
| bool hasExitingBranches(Expression* expr) { | ||
| if (!exitingBranchCachePopulated_) { | ||
| populateExitingBranchCache(getFunction()->body); |
There was a problem hiding this comment.
Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.
This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.
There was a problem hiding this comment.
The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.
We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).
The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.
Follow up PR of #8586 to optimize CodeFolding
optimizeTerminatingTailscallsEffectAnalyzerper tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).Replace the per-item walks with a single O(N) bottom-up
PostWalker(populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).Example: AssemblyScript GC compiles
__visit_membersas abr_tabledispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).benchmark data
The test module is from issue #7319
#7319 (comment)
In main head
time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling -o /dev/null ./test3.wasm real 9m16.111s user 35m33.985s sys 0m51.000sIn the PR