Introduce --step-timeout option for kprove to ensure timely termination#4930
Introduce --step-timeout option for kprove to ensure timely termination#4930Stevengre wants to merge 6 commits into
--step-timeout option for kprove to ensure timely termination#4930Conversation
Implements #4924: factor kontrol's `--per-depth-timeout` mechanism out of tool-specific code and into a generic policy in `Prover.advance_proof`. `advance_proof` now runs each step under the prover's `step_timeout` (a per-step wall-clock budget). A step that exceeds its budget is interrupted, the prover is asked to `shrink_step` (do less work per step), and the step is retried (the timed-out node was never committed, so it reappears from `get_steps`). If the prover cannot shrink further, advancement stops. Provers without a `step_timeout` run synchronously as before, so `advance_proof` stays generic and gains no new parameter. - Prover: `step_timeout` attribute (whole seconds, None disables) plus no-op `shrink_step()` / `interrupt()` hooks - APRProver: `step_timeout` ctor arg (floored at 1s); `shrink_step` halves `execute_depth`; `interrupt` aborts the in-flight Kore request - KCFGExplore/CTermSymbolic/KoreClient/JsonRpc*/Transport: `interrupt()` that force-unblocks an in-flight single-socket request and reconnects - Unit tests: shrink-until-progress, stop-at-floor, fast-path, disabled
The step-timeout/shrink policy added to Prover.advance_proof was reachable only programmatically. Wire it through the pyk prove command: add a step_timeout field to ProveOptions, a --step-timeout argument (whole seconds, floored at 1 by APRProver), and thread the value from ProveRpc into APRProver. Omitting the flag leaves step_timeout=None, preserving the prior synchronous behavior.
fa4c4fa to
bf77114
Compare
|
This looks pretty good to me. I guess I'm not an expert on whether this is the best way to handle this issue, but one question I have: when we interrupt a step, it will disconnect from that instance of the haskell backend, but does that also stop the haskell backend from working on that step at the same time? Or does the haskell backend keep working on it, we are just now trying some smaller step size ourselves? |
The prior interrupt() shut down the client socket and reconnected. Empirically (legacy/haskell kore-rpc 0.1.145), that does NOT stop the backend: the request runs in a detached worker thread that the server only aborts on an explicit `cancel` JSON-RPC method, never on a dropped connection. So an interrupted step kept a core pinned at ~100% until it finished on its own, while we reconnected and retried a smaller step. Instead, inject a `cancel` request on the live connection. The server aborts the in-flight request, the awaiting thread receives a "Request cancelled" error, and the connection stays open and reusable (no reconnect). Measured: server CPU drops from ~100% to ~2% immediately after interrupt(). - Transport: replace no-arg interrupt() with send_interrupt(data) (raw out-of-band send on the live connection; default no-op; HTTP stays no-op, one conn per request) - SingleSocketTransport: sendall the cancel payload, no shutdown/reconnect - JsonRpcClient: build the `cancel` request and hand it to the transport
Asserts that KoreClient.interrupt() aborts an in-flight (non-terminating) execute promptly with a "Request cancelled" error rather than letting it run to completion, and that the connection survives the cancel and serves a subsequent request. Covers the interrupt mechanism advance_proof's step-timeout policy depends on.
| timed = self.step_timeout is not None | ||
| executor = ThreadPoolExecutor(max_workers=1) if timed else None | ||
| try: | ||
| while True: | ||
| steps = list(proof.get_steps()) | ||
| _LOGGER.info(f'Found {len(steps)} next steps for proof: {proof.id}') | ||
| if len(steps) == 0: | ||
| break | ||
| shrank_step = False | ||
| for step in steps: | ||
| if fail_fast and proof.failed: | ||
| _LOGGER.warning(f'Terminating proof early because fail_fast is set: {proof.id}') | ||
| proof.failure_info = self.failure_info(proof) | ||
| return | ||
| if max_iterations is not None and max_iterations <= iterations: | ||
| return | ||
| if timed: | ||
| assert executor is not None | ||
| budget = self.step_timeout | ||
| assert budget is not None | ||
| future = executor.submit(self.step_proof, step) | ||
| try: | ||
| results = future.result(timeout=budget) | ||
| except FuturesTimeoutError: | ||
| # The step exhausted its budget: interrupt it, ask the prover to do less | ||
| # work per step, and re-fetch steps so the same node is retried smaller. | ||
| self.interrupt() | ||
| wait([future]) | ||
| if not self.shrink_step(): | ||
| _LOGGER.warning( | ||
| f'Proof {proof.id}: step exhausted {budget}s budget and cannot be ' | ||
| f'shrunk further; stopping.' | ||
| ) | ||
| return | ||
| _LOGGER.warning( | ||
| f'Proof {proof.id}: step exhausted {budget}s budget; shrinking and retrying.' | ||
| ) | ||
| shrank_step = True | ||
| break | ||
| else: | ||
| results = self.step_proof(step) | ||
| iterations += 1 | ||
| for result in results: | ||
| proof.commit(result) | ||
| if iterations % maintenance_rate == 0: | ||
| proof.write_proof_data() | ||
| callback(proof) | ||
| if shrank_step: | ||
| continue | ||
| finally: | ||
| if executor is not None: | ||
| executor.shutdown(wait=False) |
There was a problem hiding this comment.
Claude generated review with my guidance, I think this should be refactorde a bit:
● The timed path here is worth tidying. The try/finally only exists to shutdown the executor, but the substance is the inline if timed: block, which gives the loop body two parallel execution paths (timed vs. else: results = self.step_proof(step)) plus the shrank_step / break / continue dance threaded through the same loop. I'd extract the timed execution into one private helper so the loop body reads uniformly:
results = self._execute_step(step, executor) # None => timed out + shrank, re-fetch and retry
if results is None:
shrank_step = True
breakwith _execute_step owning the submit / result(timeout) / interrupt + shrink_step logic. One caveat: the sentinel has to distinguish "shrank, retry" (today a break + continue) from "couldn't shrink, stop" (today a return), so use two distinct signals (e.g. a small StepTimedOut/StepExhausted exception pair, or a tagged result) rather than a bare None.
One thing to keep as-is, though: the timeout/shrink/retry policy belongs in the base advance_proof, not in step_proof. It's generic and driven entirely through the shrink_step/interrupt hooks — ImpliesProver overrides none of them and gets correct behavior (timeout → stop) for free. Folding it into step_proof would force every override to re-implement it, and would violate step_proof's documented "must not modify self" contract (the shrink path mutates self.execute_depth), which is what makes it safe to run on the worker thread in the first place.
advance_proofnow runs each step under the prover'sstep_timeout(a per-step wall-clock budget). A step that exceeds its budget is interrupted, the prover is asked toshrink_step(do less work per step), and the step is retried (the timed-out node was never committed, so it reappears fromget_steps). If the prover cannot shrink further, advancement stops. Provers without astep_timeoutrun synchronously as before, soadvance_proofstays generic and gains no new parameter.Close #4924.
PoC PR: runtimeverification/kontrol#1141