BFS Atomic Counter Task by QuantamHD · Pull Request #342 · The-OpenROAD-Project/OpenSTA

QuantamHD · 2026-04-07T23:48:40Z

No description provided.

Signed-off-by: Ethan Mahintorabi <ethanmoon@google.com>

CLAassistant · 2026-04-07T23:48:46Z

All committers have signed the CLA.

dsengupta0628 · 2026-04-09T19:58:32Z

Some initial feedback (more to come):

No tests or benchmarks. A change to the core BFS traversal should include regression tests (timing correctness) and performance benchmarks (runtime comparison on representative designs).
No PR description. The motivation, design rationale, and expected improvements are not documented.
There are issues with both OpenSTA/OpenROAD regressions. Suggest to please merge with latest OpenROAD/OpenSTA master and rerun regressions

dsengupta0628 · 2026-04-09T20:28:45Z

The most critical issue here is in-degree count/decrement mismatch (topological ordering violation)

computeInDegrees() deduplicates per unique successor vertex:

  std::set<Vertex*> counted_successors;
  // ...
  if (counted_successors.insert(to_vertex).second) {
      in_degrees_[to_vertex->objectIdx()].fetch_add(1, ...);
  }

But enqueueAdjacentVertices() decrements per edge (deduped via processed_edges_):

  inserted = processed_edges_.insert(edge).second;  // per edge, not per vertex
  if (inserted) {
      int old_deg = in_degrees_[to_vertex->objectIdx()].fetch_sub(1, ...);
      if (old_deg == 1) { /* process vertex */ }
  }

When multiple edges connect the same two vertices (e.g., different timing arc sets with different when conditions), the in-degree is incremented once but decremented multiple times. Concrete example:

Vertex A has edges e1, e2 to C. Vertex B has edge e3 to C.
computeInDegrees: C gets in-degree = 2 (A counted once, B once)
enqueueAdjacentVertices(A):
- e1: decrement C: 2→1, old_deg=2, not ready
- e2: decrement C: 1→0, old_deg=1, process C
But B hasn't been visited yet — C is processed before all predecessors are done.

This violates topological ordering. C's delay calculation uses stale/uninitialized data from B's contribution. Fix: either remove the counted_successors dedup in computeInDegrees (count per edge), or dedup per target vertex in enqueueAdjacentVertices.

--One possible fix would be----
Count per-edge in computeInDegrees (remove the dedup), and remove the counted_successors set entirely in computeInDegrees, so that in-degree counts every edge passing the predicate, matching the per-edge decrement in enqueueAdjacentVertices.

dsengupta0628 · 2026-04-09T20:44:20Z

One comment regarding the change- this change bypasses the incremental delay tolerance.

Old behavior
With the old BfsFwdIterator, if a vertex's output slews didn't change, enqueueAdjacentVertices was not called — successors were never enqueued, and the entire downstream cone was skipped. This is the key incremental optimization: a small change that doesn't affect slews stops
propagating immediately.

New behavior
In the new enqueueAdjacentVertices, when a successor's in-degree reaches 0, , it dispatches this lambda:


  dispatch_queue_->dispatch([this, to_vertex](size_t tid) {
      current_thread_id = tid;
      visitors_[tid]->visit(to_vertex);     // → findVertexDelay (may or may not propagate)
      visit_count_->fetch_add(1, ...);
      enqueueAdjacentVertices(to_vertex);   // always propagates
  });

Two calls to enqueueAdjacentVertices(to_vertex) happen:

Conditionally from findVertexDelay inside the visitor (only if slews changed)
Unconditionally from the lambda itself

If slews didn't change, call 1 is skipped. But call 2 runs anyway, decrements all successors' in-degrees, and dispatches any that become ready. The processed_edges_ set is empty for those edges (since call 1 was skipped), so they all get processed.

This cascades - every successor is visited, and every successor unconditionally propagates to its successors, and so on through the entire downstream cone.

But with new in-degree increment change, you probably need this. If a vertex C has 2 edges from A and B, and say A's slews changed but B's didn't:

With the old BFS: A propagates, enqueues C. B doesn't propagate. C is still visited (reached via A). Works fine — any single predecessor can trigger a visit.
With in-degree counting: A decrements C (2→1). If B doesn't decrement (because its slews didn't change), C stays at in-degree 1 forever — it's never processed, even though A's change means C needs recomputation.

So the unconditional propagation is necessary to make the in-degree mechanism work. But it means every vertex in the downstream cone is visited during incremental updates, even when slew changes die out early. For a small ECO on a large design, the old code might recompute tens
of vertices; the new code recomputes the entire fanout cone.

If the parallel speedup outweighs the incremental efficiency loss, it's a valid tradeoff. Do you see this making a dent in runtime?
Or you can use in-degree BFS only for non-incremental (first pass), fall back to the old BfsFwdIterator for incremental updates where selective propagation matters most?

dsengupta0628 · 2026-04-09T21:11:23Z

One comment regarding the change- this change bypasses the incremental delay tolerance.

Old behavior With the old BfsFwdIterator, if a vertex's output slews didn't change, enqueueAdjacentVertices was not called — successors were never enqueued, and the entire downstream cone was skipped. This is the key incremental optimization: a small change that doesn't affect slews stops propagating immediately.

New behavior In the new enqueueAdjacentVertices, when a successor's in-degree reaches 0, , it dispatches this lambda:
  dispatch_queue_->dispatch([this, to_vertex](size_t tid) {
      current_thread_id = tid;
      visitors_[tid]->visit(to_vertex);     // → findVertexDelay (may or may not propagate)
      visit_count_->fetch_add(1, ...);
      enqueueAdjacentVertices(to_vertex);   // always propagates
  });
Two calls to enqueueAdjacentVertices(to_vertex) happen:

Conditionally from findVertexDelay inside the visitor (only if slews changed)

Unconditionally from the lambda itself

If slews didn't change, call 1 is skipped. But call 2 runs anyway, decrements all successors' in-degrees, and dispatches any that become ready. The processed_edges_ set is empty for those edges (since call 1 was skipped), so they all get processed.

This cascades - every successor is visited, and every successor unconditionally propagates to its successors, and so on through the entire downstream cone.

But with new in-degree increment change, you probably need this. If a vertex C has 2 edges from A and B, and say A's slews changed but B's didn't:

With the old BFS: A propagates, enqueues C. B doesn't propagate. C is still visited (reached via A). Works fine — any single predecessor can trigger a visit.

With in-degree counting: A decrements C (2→1). If B doesn't decrement (because its slews didn't change), C stays at in-degree 1 forever — it's never processed, even though A's change means C needs recomputation.

So the unconditional propagation is necessary to make the in-degree mechanism work. But it means every vertex in the downstream cone is visited during incremental updates, even when slew changes die out early. For a small ECO on a large design, the old code might recompute tens of vertices; the new code recomputes the entire fanout cone.

If the parallel speedup outweighs the incremental efficiency loss, it's a valid tradeoff. Do you see this making a dent in runtime? Or you can use in-degree BFS only for non-incremental (first pass), fall back to the old BfsFwdIterator for incremental updates where selective propagation matters most?

Also note that the unconditional call was necessary because the in-degree mechanism requires every vertex to decrement its successors (otherwise successors are stuck). But it means the incremental optimization (stop propagating when slews don't change) is defeated. Now, processed_edges_ was added to prevent the two calls from double-decrementing the same edge.
If you remove the unconditional call then processed_edges_ + its mutex become unnecessary. No mutex, no std::set, no O(log n) tree operations. Just an atomic decrement per edge — the same cost as incrementing in computeInDegrees.

* fix power_json.tcl * get rid of the if/else statements throughout

Signed-off-by: James Cherry <cherry@parallaxsw.com>

QuantamHD · 2026-04-17T18:42:03Z

Sorry @dsengupta0628 for pushing this without context. I pushed this after speaking with Tom and Matt in our weekly meeting. They asked me to push it here. I was messing around with Antigravity to see how hard it would be to make a task based iterator.

This code is untested, and was pushed mainly just for reference on how a task based system could be introduced. It did manage to pass a lot of our test cases, but it's definitely not 100% correct.

I think the main interesting thing I found from this process is that it's relatively easy to create a new BFSFwdIterator and replace it in the delay calculator.

This is definitely junk code, and again was just pushed for reference.

dsengupta0628 · 2026-04-18T11:26:10Z

Sorry @dsengupta0628 for pushing this without context. I pushed this after speaking with Tom and Matt in our weekly meeting. They asked me to push it here. I was messing around with Antigravity to see how hard it would be to make a task based iterator.

This code is untested, and was pushed mainly just for reference on how a task based system could be introduced. It did manage to pass a lot of our test cases, but it's definitely not 100% correct.

I think the main interesting thing I found from this process is that it's relatively easy to create a new BFSFwdIterator and replace it in the delay calculator.

This is definitely junk code, and again was just pushed for reference.

Hi Ethan. Yes Matt mentioned this was an idea we can explore-after I reviewed this. So I picked up your code and tried to clean it (resolved STA regressions with some fixes) up but was getting still more regression failures. So I am using your ideas and reimplementing Kahn’s BFS (that you implemented here actually) both in delay calc and arrival propagation. I have a working model now. But clearing out some more nitty gritty w.r.t incremental updates. Thanks

jjcherry56 · 2026-06-09T03:22:08Z

I disagree that it needs specific regressions. EVERY existing regression already exercises the delay calculator BFS. It gets plenty of testing with existing regressions.

The biggest problem it has is the assumption that you can use object id's as a index into a table (in_degrees_). This may be true for dbSta, but it is not true for OpenSTA.

processed_edges_ is not necessary when the queuing is done correctly.. It also adds a mutex that will no doubt slow it down when multi-threaded.

counted_successors is also not necessary.

As is, it fails about 1/3 of the fast private regressions and 100% of the slow regressions.

I spent a few hours working on it and got to to pass all of the non-incremental regressions to see if it had any performance advantage. It is uniformly slightly slower than the existing BFS on the slow regressions.

BFS Atomic Counter Task

c4e516e

Signed-off-by: Ethan Mahintorabi <ethanmoon@google.com>

eder-matheus pushed a commit to eder-matheus/OpenSTA that referenced this pull request Apr 11, 2026

Report power as JSON (The-OpenROAD-Project#342)

56e4bd8

* fix power_json.tcl * get rid of the if/else statements throughout

eder-matheus pushed a commit to eder-matheus/OpenSTA that referenced this pull request Apr 11, 2026

rm PR The-OpenROAD-Project#342 turd

4afa443

Signed-off-by: James Cherry <cherry@parallaxsw.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFS Atomic Counter Task#342

BFS Atomic Counter Task#342
QuantamHD wants to merge 1 commit into
The-OpenROAD-Project:masterfrom
QuantamHD:atomic_task

QuantamHD commented Apr 7, 2026

Uh oh!

CLAassistant commented Apr 7, 2026 •

edited

Loading

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

QuantamHD commented Apr 17, 2026 •

edited

Loading

Uh oh!

dsengupta0628 commented Apr 18, 2026

Uh oh!

jjcherry56 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

QuantamHD commented Apr 7, 2026

Uh oh!

CLAassistant commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

dsengupta0628 commented Apr 9, 2026

Uh oh!

QuantamHD commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsengupta0628 commented Apr 18, 2026

Uh oh!

jjcherry56 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Apr 7, 2026 •

edited

Loading

QuantamHD commented Apr 17, 2026 •

edited

Loading