Skip to content

Introduce --step-timeout option for kprove to ensure timely termination#4930

Open
Stevengre wants to merge 6 commits into
developfrom
feat/per-depth-timeout-advance-proof
Open

Introduce --step-timeout option for kprove to ensure timely termination#4930
Stevengre wants to merge 6 commits into
developfrom
feat/per-depth-timeout-advance-proof

Conversation

@Stevengre

@Stevengre Stevengre commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

advance_proof now runs each step under the prover's step_timeout (a per-step wall-clock budget). A step that exceeds its budget is interrupted, the prover is asked to shrink_step (do less work per step), and the step is retried (the timed-out node was never committed, so it reappears from get_steps). If the prover cannot shrink further, advancement stops. Provers without a step_timeout run synchronously as before, so advance_proof stays generic and gains no new parameter.

Close #4924.
PoC PR: runtimeverification/kontrol#1141

@Stevengre Stevengre requested review from ehildenb and tothtamas28 June 8, 2026 11:44
@Stevengre Stevengre self-assigned this Jun 8, 2026
@deosa-arch deosa-arch changed the base branch from master to develop June 8, 2026 11:45
Stevengre added 2 commits June 8, 2026 19:48
Implements #4924: factor kontrol's
`--per-depth-timeout` mechanism out of tool-specific code and into a
generic policy in `Prover.advance_proof`.

`advance_proof` now runs each step under the prover's `step_timeout`
(a per-step wall-clock budget). A step that exceeds its budget is
interrupted, the prover is asked to `shrink_step` (do less work per
step), and the step is retried (the timed-out node was never committed,
so it reappears from `get_steps`). If the prover cannot shrink further,
advancement stops. Provers without a `step_timeout` run synchronously as
before, so `advance_proof` stays generic and gains no new parameter.

- Prover: `step_timeout` attribute (whole seconds, None disables) plus
  no-op `shrink_step()` / `interrupt()` hooks
- APRProver: `step_timeout` ctor arg (floored at 1s); `shrink_step`
  halves `execute_depth`; `interrupt` aborts the in-flight Kore request
- KCFGExplore/CTermSymbolic/KoreClient/JsonRpc*/Transport: `interrupt()`
  that force-unblocks an in-flight single-socket request and reconnects
- Unit tests: shrink-until-progress, stop-at-floor, fast-path, disabled
The step-timeout/shrink policy added to Prover.advance_proof was reachable
only programmatically. Wire it through the pyk prove command: add a
step_timeout field to ProveOptions, a --step-timeout argument (whole seconds,
floored at 1 by APRProver), and thread the value from ProveRpc into APRProver.
Omitting the flag leaves step_timeout=None, preserving the prior synchronous
behavior.
@Stevengre Stevengre force-pushed the feat/per-depth-timeout-advance-proof branch from fa4c4fa to bf77114 Compare June 8, 2026 11:50
@ehildenb

ehildenb commented Jun 8, 2026

Copy link
Copy Markdown
Member

This looks pretty good to me. I guess I'm not an expert on whether this is the best way to handle this issue, but one question I have: when we interrupt a step, it will disconnect from that instance of the haskell backend, but does that also stop the haskell backend from working on that step at the same time? Or does the haskell backend keep working on it, we are just now trying some smaller step size ourselves?

Stevengre added 3 commits June 9, 2026 15:41
The prior interrupt() shut down the client socket and reconnected. Empirically
(legacy/haskell kore-rpc 0.1.145), that does NOT stop the backend: the request
runs in a detached worker thread that the server only aborts on an explicit
`cancel` JSON-RPC method, never on a dropped connection. So an interrupted step
kept a core pinned at ~100% until it finished on its own, while we reconnected
and retried a smaller step.

Instead, inject a `cancel` request on the live connection. The server aborts the
in-flight request, the awaiting thread receives a "Request cancelled" error, and
the connection stays open and reusable (no reconnect). Measured: server CPU drops
from ~100% to ~2% immediately after interrupt().

- Transport: replace no-arg interrupt() with send_interrupt(data) (raw out-of-band
  send on the live connection; default no-op; HTTP stays no-op, one conn per request)
- SingleSocketTransport: sendall the cancel payload, no shutdown/reconnect
- JsonRpcClient: build the `cancel` request and hand it to the transport
Asserts that KoreClient.interrupt() aborts an in-flight (non-terminating) execute
promptly with a "Request cancelled" error rather than letting it run to completion,
and that the connection survives the cancel and serves a subsequent request. Covers
the interrupt mechanism advance_proof's step-timeout policy depends on.
Comment thread pyk/src/tests/unit/test_advance_proof.py Outdated
@Stevengre Stevengre requested a review from ehildenb June 15, 2026 07:41
Comment on lines +556 to +607
timed = self.step_timeout is not None
executor = ThreadPoolExecutor(max_workers=1) if timed else None
try:
while True:
steps = list(proof.get_steps())
_LOGGER.info(f'Found {len(steps)} next steps for proof: {proof.id}')
if len(steps) == 0:
break
shrank_step = False
for step in steps:
if fail_fast and proof.failed:
_LOGGER.warning(f'Terminating proof early because fail_fast is set: {proof.id}')
proof.failure_info = self.failure_info(proof)
return
if max_iterations is not None and max_iterations <= iterations:
return
if timed:
assert executor is not None
budget = self.step_timeout
assert budget is not None
future = executor.submit(self.step_proof, step)
try:
results = future.result(timeout=budget)
except FuturesTimeoutError:
# The step exhausted its budget: interrupt it, ask the prover to do less
# work per step, and re-fetch steps so the same node is retried smaller.
self.interrupt()
wait([future])
if not self.shrink_step():
_LOGGER.warning(
f'Proof {proof.id}: step exhausted {budget}s budget and cannot be '
f'shrunk further; stopping.'
)
return
_LOGGER.warning(
f'Proof {proof.id}: step exhausted {budget}s budget; shrinking and retrying.'
)
shrank_step = True
break
else:
results = self.step_proof(step)
iterations += 1
for result in results:
proof.commit(result)
if iterations % maintenance_rate == 0:
proof.write_proof_data()
callback(proof)
if shrank_step:
continue
finally:
if executor is not None:
executor.shutdown(wait=False)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude generated review with my guidance, I think this should be refactorde a bit:

● The timed path here is worth tidying. The try/finally only exists to shutdown the executor, but the substance is the inline if timed: block, which gives the loop body two parallel execution paths (timed vs. else: results = self.step_proof(step)) plus the shrank_step / break / continue dance threaded through the same loop. I'd extract the timed execution into one private helper so the loop body reads uniformly:

results = self._execute_step(step, executor)   # None => timed out + shrank, re-fetch and retry
if results is None:
    shrank_step = True
    break

with _execute_step owning the submit / result(timeout) / interrupt + shrink_step logic. One caveat: the sentinel has to distinguish "shrank, retry" (today a break + continue) from "couldn't shrink, stop" (today a return), so use two distinct signals (e.g. a small StepTimedOut/StepExhausted exception pair, or a tagged result) rather than a bare None.

One thing to keep as-is, though: the timeout/shrink/retry policy belongs in the base advance_proof, not in step_proof. It's generic and driven entirely through the shrink_step/interrupt hooks — ImpliesProver overrides none of them and gets correct behavior (timeout → stop) for free. Folding it into step_proof would force every override to re-implement it, and would violate step_proof's documented "must not modify self" contract (the shrink path mutates self.execute_depth), which is what makes it safe to run on the worker thread in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support progressive depth halving as a generic policy in Prover.advance_proof

2 participants