BFS Atomic Counter Task#342
Conversation
Signed-off-by: Ethan Mahintorabi <ethanmoon@google.com>
|
Some initial feedback (more to come):
|
|
The most critical issue here is in-degree count/decrement mismatch (topological ordering violation) computeInDegrees() deduplicates per unique successor vertex: But enqueueAdjacentVertices() decrements per edge (deduped via processed_edges_): When multiple edges connect the same two vertices (e.g., different timing arc sets with different when conditions), the in-degree is incremented once but decremented multiple times. Concrete example:
This violates topological ordering. C's delay calculation uses stale/uninitialized data from B's contribution. Fix: either remove the counted_successors dedup in computeInDegrees (count per edge), or dedup per target vertex in enqueueAdjacentVertices. --One possible fix would be---- |
|
One comment regarding the change- this change bypasses the incremental delay tolerance. Old behavior New behavior Two calls to enqueueAdjacentVertices(to_vertex) happen:
If slews didn't change, call 1 is skipped. But call 2 runs anyway, decrements all successors' in-degrees, and dispatches any that become ready. The processed_edges_ set is empty for those edges (since call 1 was skipped), so they all get processed. This cascades - every successor is visited, and every successor unconditionally propagates to its successors, and so on through the entire downstream cone. But with new in-degree increment change, you probably need this. If a vertex C has 2 edges from A and B, and say A's slews changed but B's didn't:
So the unconditional propagation is necessary to make the in-degree mechanism work. But it means every vertex in the downstream cone is visited during incremental updates, even when slew changes die out early. For a small ECO on a large design, the old code might recompute tens If the parallel speedup outweighs the incremental efficiency loss, it's a valid tradeoff. Do you see this making a dent in runtime? |
Also note that the unconditional call was necessary because the in-degree mechanism requires every vertex to decrement its successors (otherwise successors are stuck). But it means the incremental optimization (stop propagating when slews don't change) is defeated. Now, processed_edges_ was added to prevent the two calls from double-decrementing the same edge. |
* fix power_json.tcl * get rid of the if/else statements throughout
Signed-off-by: James Cherry <cherry@parallaxsw.com>
|
Sorry @dsengupta0628 for pushing this without context. I pushed this after speaking with Tom and Matt in our weekly meeting. They asked me to push it here. I was messing around with Antigravity to see how hard it would be to make a task based iterator. This code is untested, and was pushed mainly just for reference on how a task based system could be introduced. It did manage to pass a lot of our test cases, but it's definitely not 100% correct. I think the main interesting thing I found from this process is that it's relatively easy to create a new BFSFwdIterator and replace it in the delay calculator. This is definitely junk code, and again was just pushed for reference. |
Hi Ethan. Yes Matt mentioned this was an idea we can explore-after I reviewed this. So I picked up your code and tried to clean it (resolved STA regressions with some fixes) up but was getting still more regression failures. So I am using your ideas and reimplementing Kahn’s BFS (that you implemented here actually) both in delay calc and arrival propagation. I have a working model now. But clearing out some more nitty gritty w.r.t incremental updates. Thanks |
|
I disagree that it needs specific regressions. EVERY existing regression already exercises the delay calculator BFS. It gets plenty of testing with existing regressions. The biggest problem it has is the assumption that you can use object id's as a index into a table (in_degrees_). This may be true for dbSta, but it is not true for OpenSTA. processed_edges_ is not necessary when the queuing is done correctly.. It also adds a mutex that will no doubt slow it down when multi-threaded. counted_successors is also not necessary. As is, it fails about 1/3 of the fast private regressions and 100% of the slow regressions. I spent a few hours working on it and got to to pass all of the non-incremental regressions to see if it had any performance advantage. It is uniformly slightly slower than the existing BFS on the slow regressions. |
No description provided.