benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers#9
Open
turingfan wants to merge 40 commits into
Open
benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers#9turingfan wants to merge 40 commits into
turingfan wants to merge 40 commits into
Conversation
Delete scripts/benchmark_level4.py and scripts/analyze_level4.py. These had hardcoded paths, an EXCLUDED_GAMES list from ad-hoc instructions, an incompatible CSV schema (flat_*/hash_* columns), and a SIGKILL-on-timeout bug that discarded all solver output, recording nodes: 0. The redux and unwinnable experiment pairs are left intact pending orchestrator review — see commit message body for analysis. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bring csv_schema.md up to full contract status: add explicit column numbers, source fields, full outcome-vocabulary flow table (solver JSON string → CSV value → bench-hook bucket), semantic distinction between TIMEOUT / TERMINATED / KILLED, and a conformance-gaps section listing five discrepancies found by code inspection (SIGTERM target, UNKNOWN not counted by hook, smart-solvability vocabulary, time_us rounding, cache_capacity blank sentinel). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep local Claude Code settings out of the repo (the file should stay deleted from tracking even though a local copy persists on disk), and ignore transient agent worktrees. Prevents the file re-entering the index as an unmerged conflict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stage 1 step 7: START-HERE.md — single entry point answering "which script for which job" with copy-pasteable commands, a known-rough-edges warning box (worker-count safety, SIGTERM caveat, compare_benchmarks.py under review), and links to csv_schema.md and the rationalisation plan. Stage 1 step 8: script-inventory.md — table of every surviving benchmarking script with one-line job, inputs, outputs, and status (core / experiment / stage2-eval / dedup-candidate / analysis / out-of-scope). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Relocate docs/benchmarking/archive/ (9 superseded mac-dev-benchmark-enhancements and legacy comparison/implementation docs) to 01-Knowledge-Base/Archive/benchmarking-legacy-docs/ per the repo rule that completed/superseded docs live in the Knowledge-Base, not the solver repo. README now points at START-HERE and the active doc set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PICKUP records Stage 1 completion, kept-pair decisions, and carried-forward findings. Stage 2 plan breaks the work into a sequential kill-discipline/ concurrency spine plus parallel docs/decision tasks, with acceptance criteria (clean bench_multiplicity.sh run), an agent-brief template, and open questions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…GNU parallel) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…analysis/*.R) The script was already marked DEPRECATED at the top (pointing to design.md §7) and design.md §10 listed it as deprecated. The R analysis layer (benchmark.R, summary.R, compare_labels.R) fully covers its useful function — baseline vs current comparison — with superior statistics (bootstrap CI, Wilcoxon test, HTML scatter plot). The only unique capability (RHT hardware normalization) is a legacy concept not used in the current CSV-based workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New scripts/bench_lib/process.py: run_with_deadline() spawns the solver in its own session (start_new_session=True) so the solver AND any /usr/bin/time wrapper share a process group. Solver --timeout is authoritative; Python deadline = 1.5x (D1, no floor/cap); on overrun SIGTERM the group -> 30s grace -> SIGKILL the group -> reap. Partial stdout is always captured; disposition distinguishes EXITED_OK/EXITED_ERR/KILLED_AFTER_SIGTERM/KILLED_HARD. 13 unit tests (fake child processes) pass. run_benchmark.py: run_solver() delegates to the helper (fixes the orphaned- grandchild bug); RSS falls back to solver_resident_bytes then 0, never fabricated; solver "failed" maps to FAILED (no bare UNKNOWN for real outcomes). csv_schema.md: reconciled with the above (FAILED value, 1.5x, RSS fallback, process-group kill); conformance gaps #1 and #3 marked resolved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bench_level5_unwinnable2.sh removed — one-off scratch variant (all
standard run_variant calls commented out, only solvitaire-hash-only
invocation kept); adds no reusable capability over the base script.
tuesday-night-redux{,2}.sh kept — the 2 variant has a genuinely unique
capability (runs all four solver variants with --label flags for
per-variant comparison) but has a $SOLVER env-override design quirk and
alina excluded; merging is not clean, so both are kept and each now
carries a one-line header explaining what distinguishes it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inventory: add bench_lib/process.py (core lib), drop removed compare_benchmarks.py, fold redux2 into experiments with note, record unwinnable2 removal, drop the dedup-candidate section. PICKUP: Stage 2 progress, remaining tasks, the worktree-stale-base process note, and the redux2 $SOLVER flaw flagged for Ian. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l only) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… GNU parallel (T2/T3) T2 - benchmark_orchestrator.py:run_chunk() now calls bench_lib.process.run_with_deadline with a hard ceiling (seeds x timeout_ms/1000 x 1.5 x 1.5, min 120s). On overrun the entire process group is SIGTERM'd then SIGKILL'd; the chunk is recorded as failed; the pool continues and prints a clear message. T3 - multiprocessing.Pool replaced with GNU parallel (--jobs N --memfree 3G). Workers default to min(cpu_count//2, floor(total_RAM x 80% / 3 GB)); warn but do not refuse if --workers exceeds the safe ceiling. --full is now required to opt into GAME_CONFIGS_FULL (14-game matrix); default scope is GAME_CONFIGS_QUICK. --dry-run shows plan without needing binaries or parallel. bench_multiplicity.sh replaces xargs -P with parallel and gains the same memory-aware worker default. New files: scripts/bench_lib/concurrency.py (compute_jobs() pure function + get_total_ram_bytes) scripts/bench_lib/test_concurrency.py (19 unit tests; all pass) Updated files: scripts/benchmark_orchestrator.py scripts/experiments/bench_multiplicity.sh docs/benchmarking/active/START-HERE.md docs/benchmarking/active/script-inventory.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…idable kills Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ential deps Add docs/benchmarking/active/remote-runs.md (build + run on remote Linux, worker safety, clean-timeout notes, bench archiving). setup_remote.sh now installs GNU parallel and build-essential (apt) / parallel (brew) — they were missing and are required since Stage 2. START-HERE links to the remote guide. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnoses why wall-based timeouts are load-dependent; proposes CPU-time budget (CLOCK_PROCESS_CPUTIME_ID) + solver wall safety-cap, optional --max-states for reproducible cutoffs. Decision pending (regenerating oracles is the main cost). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…terminated not killed) Root-caused the remote 'many KILLED' runs (small --timeout under parallel load): - Solver only handled SIGINT, so the wrapper's SIGTERM killed it with no output -> KILLED. Now solver.cpp registers SIGTERM with the same handler (graceful: sets interrupt flag -> DFS returns TERMINATED -> JSON flushed). SIGINT (^C) behaviour unchanged. main.cpp emits solution_type "terminated" for TERMINATED. - Wrapper grace was 1.5x timeout, too tight vs fixed per-run overhead (dominated by default-cache mmap, ~0 CPU). Now total_wait = timeout + max(0.5*timeout, min_grace_s=10s); a 1000ms timeout gets an 11s window. SIGTERM stays the signal, 30s grace, then SIGKILL. - run_benchmark.py maps "terminated" -> TERMINATED; csv_schema.md updated. Verified: unit_tests pass; regression_level1 (x4 variants) pass; 33 bench_lib tests pass; direct SIGTERM -> exit 0 + "terminated" JSON with stats; phase-C smoke at --timeout 1000 -> 14 SOLVED / 6 TIMEOUT / 0 KILLED. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…build quirk Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No logic bug found in build.sh (it always builds the 'solvitaire' target and incremental rebuilds work — verified). The real failure mode is a silently STALE binary when incremental detection is fooled (e.g. sources copied between trees, which is how the orchestrator integrates sub-agent work). Two safeguards: - --clean: wipe cmake-build-<cfg>/ for a guaranteed-fresh build. - end-of-build summary lists each binary with size + mtime, so a stale/older binary is visible at a glance (portable GNU/BSD stat). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sub-agents/runs left compiled .pyc tracked (incl. a stale one from the removed compare_benchmarks.py). Untrack all and ignore Python bytecode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a run ends non-clean, run_solver now emits a [kill-diag] line to stderr identifying the cause: exit signal (-9 SIGKILL/OOM, -6 SIGABRT/bad_alloc, -15 unhandled SIGTERM), whether the wrapper or something external killed it, wall time, stdout bytes, and the solver's stderr tail. Script-only change — no rebuild needed; usable immediately on the remote box to diagnose the KILLED runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…run ratio, RSS Earlier version mislabelled KILLED_HARD+SIGTERM as 'stale binary' — wrong (the solver's SIGTERM handler works; that combo is a swap-delayed shutdown). Now keyed on disposition first, and reports the wall/timeout overrun ratio + peak RSS, which are the real tells of memory-pressure/swap (solver running far past its deadline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…xternal kills /usr/bin/time/bash surface a signalled child as 128+signum, not a negative rc, so 137 (SIGKILL) was printed as a bare 'exit code 137'. Now decode it, and compare wall to the wrapper's own deadline: a SIGKILL BEFORE that deadline is external (OOM killer or 'parallel --memfree' kill+requeue), not the wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…EMFREE --memfree does NOT just pause — it KILLS+requeues the youngest job when free mem drops below 50% of the size, producing exit-137-mid-run + reruns (a prime suspect for the remote KILLs). Default it OFF; bound concurrency via an accurate --jobs instead. Set BENCH_MEMFREE=3G to A/B test the old behaviour. Fixes the misleading 'just pauses' comment + banner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lvers The solver runs in its own session (start_new_session, for timeout killpg), so an external ^C never reached it — when parallel/python died the solver orphaned and persisted. Fixes: - run_with_deadline: a BaseException handler SIGKILLs the child's process group if the wait is interrupted, so an interrupted run never leaves a solver behind. - run_benchmark.py: SIGTERM is turned into KeyboardInterrupt so a parent's kill triggers the same teardown as ^C. - bench_multiplicity.sh: trap INT/TERM to signal the whole process group (stops parallel launching more, propagates to workers). Verified: SIGINT to a running phase-C batch (6 live solver procs) leaves 0 survivors; bench_lib unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Confirmed root cause: cgroup memory.max on the user slice (64 GiB), not host RAM. 32 multiplicity workers (~6.4GB each = (capacity/2)*128B mmap) exceeded it -> cgroup OOM-killed solvitaire; 16 rode the edge. The old worker math assumed a flat 3GB/ worker against host RAM, so it allowed far too many. concurrency.py: - effective_memory_limit(): reads the binding limit = min(host RAM, cgroup memory.max), walking the cgroup v2 tree (finds user.slice limits) + v1 fallback. - per_worker_bytes(cache_type, capacity): exact flat-family reservation (capacity/2 * cluster_bytes: flat 64, hash-only 16, predecessor/multiplicity 128) + overhead; LRU measured ~320 B/entry (heap-grown, workload-bounded). - planning_worker_bytes(): a mixed run is sized by the flat-family reservation (always-resident, the OOM culprit), not LRU's worst case. - compute_jobs(requested, limit, worker_bytes): clamp to memory-safe max + warn. bench_multiplicity.sh + benchmark_orchestrator.py: use the above (cache types from the selected phases/solvers); print the detected limit + per-worker + max_safe; and DROP parallel --memfree (it killed+requeued jobs) now that --jobs is accurate. On a 64 GiB slice, phase C now auto-sizes to ~7 multiplicity workers and clamps larger --workers with an OOM warning. 29 bench_lib tests pass; dry-run + a real phase-C run verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
solver_node carried boost::optional<lru_cache::item_list::iterator> (16 B) on every DFS frontier frame, but it is assigned only on the LRU path (Policy::computes_hash == false; used for set_non_live live-bit tracking). Move it into an empty-base cache_state_holder so flat-family nodes omit it entirely: solver_node shrinks 48 -> 32 B for flat/multiplicity, unchanged at 48 B for LRU. Guarded the one unconditional cache_state read in revert_to_last_node_with_children with if constexpr. Correctness-neutral (verified): release + debug unit_tests, regression_level1 (default/flat/hash-only/lru), trace_identity_flat/lru, trace_until_timeout, and SearchTraceAgreement all pass. Measured ~3-4% peak-RSS reduction on deep free-cell searches — modest, because per-frame memory is dominated by the separately heap-allocated child_moves vector (see KI-26), not the node struct. Docs: adds KI-26 (non-cache frontier memory = child_moves x runaway depth, deferred algorithmic work) and KI-27 (trace_regression_level1/2 fail against a stale reference binary, pre-existing and unrelated to this change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the search budget load-invariant. --timeout is now measured in CPU time
(user+system, CLOCK_PROCESS_CPUTIME_ID) instead of wall clock, so a run gets the
same compute regardless of scheduler contention — essential for fair benchmark
comparison under parallel load. Two companions:
* --wall-cap-mult (default 10): a wall safety-cap (wall_cap_mult x timeout) so
the solver always self-terminates even if badly descheduled.
* --max-states (default 0=off): a deterministic, machine/load-independent cutoff
for bit-for-bit reproducible comparisons.
Budget checks are batched every 4096 nodes so the (syscall) CPU-clock read does not
slow the DFS hot loop.
Wrapper/orchestrator: run_benchmark.py's deadline + overrun diagnostics and the
orchestrator chunk-ceiling now key off the solver's WALL ceiling (wall_cap_mult x
timeout), not the CPU budget — otherwise a legitimately-descheduled run is killed.
csv_schema.md: timeout_ms is CPU ms; time_us remains wall.
Oracles: L2/L3 regenerated under the new semantics (all-definitive, exactly
reproducible; verified green with --enforce-node-counts, default+flat+hash-only+lru).
L1 left unchanged (timeout-insensitive). L4/L5 NOT regenerated — their outcome-only
comparison is untrustworthy with the unsound `both` streamliner (see KI-29).
Decisions (with Ian): CPU=user+sys; add --max-states; K=10; switch now + regenerate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…igation - KI-27: trace_regression_level1/2 red against a stale reference binary (commit not in HEAD history); pre-existing, unrelated to recent changes — needs re-baseline triage. - KI-28: benchmark worker sizing cannot bound a single runaway worker (cache-based budget vs the unbounded child_moves frontier under a 64 GiB cgroup); proposes per-worker systemd-run MemoryMax. - KI-29: L4/L5 outcome-only regression is untrustworthy with the unsound `both` streamliner (outcome can vary by search ordering); options to fix. - PICKUP: record the OOM root-cause characterisation, cache_state fix, CPU-time timeout, oracle regen status, and the new open decisions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…linux pending) trace_regression_level1/2 were red against a stale reference (commit no longer in branch history after PR squash-merges). Triage confirmed the divergence is path-only — correct outcomes, and the current binary's node counts match the validated level-1 oracle; the reference simply predated the KI-23 / Stage-6 multiplicity suit-symmetry changes. So this re-baselines the reference rather than chasing a non-bug. Repoints the CMake TRACE_REF_BIN default (mac + linux) to dated names built from this binary. The mac reference (solvitaire-trace-reference-mac-arm64-20260603-7eb5883) is built and verified: trace_regression_level1 (36s) and level2 (184s) both pass. NOTE: existing cmake-build-trace dirs cache the old TRACE_REF_BIN — clear with `cmake -DTRACE_REF_BIN=… cmake-build-trace` or delete the build dir. Reference binaries live in (untracked) 05-Executables/reference and are distributed out-of-band. The Linux ARM64 reference (same dated name) still needs a container rebuild (`scripts/container-build.sh --extract-trace-binary`); see 05-Executables/reference/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…host verification Triage confirmed legitimate path-only drift, not a regression. Reference re-baselined (36162ea); trace_regression_level1/2 pass on mac. Linux arm64 binary built via container, verification on a Linux host still pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Dockerfile) Apptainer can't consume the Dockerfile directly, so this mirrors it for hosts that have apptainer (HPC / shared boxes) rather than container/docker/podman. Builds the release + trace configs; includes build/run/extract commands in the header. container-build.sh does not drive apptainer — this is a manual companion. Build: apptainer build --fakeroot solvitaire.sif solvitaire.def Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…chmark runs) bench_multiplicity.sh requires GNU parallel as its worker engine, but neither the Dockerfile nor solvitaire.def installed it. Add it to both so the benchmark scripts can run inside the container (apptainer/docker), writing results to a bound host dir via --outdir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On many-core machines the container builds were doubly serial: `cmake --build` had no
--parallel (source files compiled one at a time) and `-flto` (plain) ran LTRANS serially
("using serial compilation of N LTRANS jobs"). Add `--parallel` to the cmake --build
invocations in Dockerfile + solvitaire.def, and switch CMAKE_CXX_FLAGS_RELEASE to
-flto=auto so link-time optimisation uses all cores.
Build-only change: behaviour, node counts/oracles and traces are unaffected (verified a
release build + klondike seed 1 = 158295 states, unchanged). Caveat: parallel compile +
parallel LTO spike build-time RAM; cap with `--parallel N` on small nodes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documents Option C — building solvitaire.def into a SIF and running bench_multiplicity.sh in-container with --outdir to a host-bound dir, plus the apptainer/SLURM specifics: cgroup-aware sizing detects the allocation limit, container-build.sh doesn't drive apptainer, the phase-D LRU per-worker memory caveat (KI-28), --fakeroot, and not running ctest in the read-only SIF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ations black-hole is inherently suit-symmetric, so the right streamliner is suit-symmetry (not auto-foundations). Updated all 4 phase definitions (A/B/C/D) and the header comment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ld.sh container-build.sh was OCI-only (build -t / run -m). Add: - Runtime detection for apptainer/singularity (RUNTIME_KIND oci|apptainer); override via CONTAINER_RUNTIME. Apptainer builds the .sif from solvitaire.def and runs the existing flags via `apptainer exec`. OCI path preserved unchanged. - --editable: build the image ONCE (reuse if present) and bind the live host repo at /workspace, building binaries into the host-bound cmake-build-* dirs. Script edits need no rebuild; C++ edits only an incremental compile. For dev/benchmarking, not production. - Refactored run/build into build_image() + run_in() helpers branching on runtime. - Fixed the stale Linux trace reference path to the KI-27 re-baseline name (…-20260603-7eb5883). APPTAINER_BUILD_OPTS (default --fakeroot) for unprivileged builds. remote-runs.md updated (the script now drives apptainer; --editable documented; caveat that editable shares host cmake-build-* dirs — don't mix cross-arch builds in one checkout). NOTE: syntax/--help verified on macOS; the apptainer and OCI-editable run paths are NOT exercised here (they trigger real builds) — verify on the Linux/apptainer box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… benchmark-rationalisation All conflicts resolved in favour of this branch: - CMakeLists.txt: newer trace ref (-20260603-7eb5883) + -flto=auto + CPU-time comment - container-build.sh: full Apptainer/editable rewrite supersedes dev's 2-line ref update - level2/3 oracles (main + hash_only): CPU-time regen already incorporates KI-23 (binary had inherent_suit_symmetry); our versions are a superset # Conflicts: # CMakeLists.txt # scripts/container-build.sh # tests/oracles/level2.json # tests/oracles/level2_hash_only.json # tests/oracles/level3.json # tests/oracles/level3_hash_only.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rationalises and robustifies the benchmark tooling in the solver repo. Two stages of work across ~40 commits.
Stage 1 — Cleanup
benchmark_level4.py,analyze_level4.py, deprecatedcompare_benchmarks.py)csv_schema.mdcontract (outcome vocabulary,solution_typevalues)START-HERE.md+script-inventory.mdguides; superseded docs archived to KB.gitignore:.claude/settings.local.json+__pycache__Stage 2 — Kill discipline, safety, CPU-time timeout
scripts/bench_lib/process.py): solver and wrapper share a POSIX process group; on overrun SIGTERM→grace→SIGKILL; always captures partial JSON output. Fixes lost-work-on-kill.--timeoutis now CPU time (CLOCK_PROCESS_CPUTIME_ID, user+sys) + 10x wall safety-cap (--wall-cap-mult) + deterministic--max-statescutoff. Load-invariant; L2/L3 oracles regenerated under CPU semantics (--enforce-node-counts).solution_type:"terminated"before exit.--memfreesafety net. Replacescpu_count()fork-bomb default.cache_statedead-field drop: removed unused LRU iterator from flat/multiplicity DFS frames (solver_node 48->32 B); ~3-4% RSS reduction on deep searches.solvitaire.def+ apptainer path incontainer-build.sh+--editablemode (bind live host repo, no rebuild on edits). GNU parallel added to both images; builds parallelised (--parallel,-flto=auto).[kill-diag]logging with signal, RSS, overrun ratio on every non-clean exit.20260603-7eb5883;trace_regression_level1/2pass. Linux verification pending.docs/benchmarking/active/remote-runs.md): copy-pasteable build + benchmark workflow for a remote Linux box, including Apptainer Option C.Acceptance (T11):
bench_multiplicity.sh --phase D --seeds 1-5 --games klondikeat both 3s and 300ms timeouts — zero KILLED/TERMINATED/FAILED; timeouts return clean TIMEOUT with full partial stats.Open items (deferred, not blocking)
systemd-run MemoryMaxso a single runaway dies cleanly (deferred)bothstreamliner (methodology decision pending; L4/L5 oracles deliberately not regenerated)Test plan
bench_multiplicity.shclean run, zero avoidable kills🤖 Generated with Claude Code