benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers by turingfan · Pull Request #9 · turingfan/ReSolvitaire

turingfan · 2026-06-12T16:11:30Z

Summary

Rationalises and robustifies the benchmark tooling in the solver repo. Two stages of work across ~40 commits.

Stage 1 — Cleanup

Removed one-off scripts (benchmark_level4.py, analyze_level4.py, deprecated compare_benchmarks.py)
Authoritative csv_schema.md contract (outcome vocabulary, solution_type values)
New START-HERE.md + script-inventory.md guides; superseded docs archived to KB
.gitignore: .claude/settings.local.json + __pycache__

Stage 2 — Kill discipline, safety, CPU-time timeout

Shared process-group kill discipline (scripts/bench_lib/process.py): solver and wrapper share a POSIX process group; on overrun SIGTERM→grace→SIGKILL; always captures partial JSON output. Fixes lost-work-on-kill.
--timeout is now CPU time (CLOCK_PROCESS_CPUTIME_ID, user+sys) + 10x wall safety-cap (--wall-cap-mult) + deterministic --max-states cutoff. Load-invariant; L2/L3 oracles regenerated under CPU semantics (--enforce-node-counts).
Solver handles SIGTERM gracefully: registers SIGTERM handler, flushes JSON with solution_type:"terminated" before exit.
Bounded memory-aware concurrency via GNU parallel: ~3 GB/worker budget, floor(RAM*0.80/3GB) cap, --memfree safety net. Replaces cpu_count() fork-bomb default.
Per-chunk orchestrator timeout with process-group kill so wedged chunks cannot stall a run.
cache_state dead-field drop: removed unused LRU iterator from flat/multiplicity DFS frames (solver_node 48->32 B); ~3-4% RSS reduction on deep searches.
Apptainer/Singularity support: solvitaire.def + apptainer path in container-build.sh + --editable mode (bind live host repo, no rebuild on edits). GNU parallel added to both images; builds parallelised (--parallel, -flto=auto).
Kill diagnostics: [kill-diag] logging with signal, RSS, overrun ratio on every non-clean exit.
KI-27 resolved (mac): trace reference re-baselined to 20260603-7eb5883; trace_regression_level1/2 pass. Linux verification pending.
Remote-runs guide (docs/benchmarking/active/remote-runs.md): copy-pasteable build + benchmark workflow for a remote Linux box, including Apptainer Option C.

Acceptance (T11): bench_multiplicity.sh --phase D --seeds 1-5 --games klondike at both 3s and 300ms timeouts — zero KILLED/TERMINATED/FAILED; timeouts return clean TIMEOUT with full partial stats.

Open items (deferred, not blocking)

KI-26: non-cache DFS frontier dominates RAM on hard instances (algorithmic, out of scope)
KI-28: per-worker systemd-run MemoryMax so a single runaway dies cleanly (deferred)
KI-29: L4/L5 outcome-only regression untrustworthy with unsound both streamliner (methodology decision pending; L4/L5 oracles deliberately not regenerated)
Linux trace-ref verification still pending on a Linux host

Test plan

Gate 1 (release): unit_tests + regression_level1 x4 — all pass
Gate 2 (trace): unit_tests + trace_identity + trace_until_timeout + trace_mult_vs_flat x7 + trace_regression_level1 (150/150) + trace_regression_level2 (160/160) — all pass
Gate 3 (debug): unit_tests — pass
T11 acceptance: bench_multiplicity.sh clean run, zero avoidable kills

🤖 Generated with Claude Code

Delete scripts/benchmark_level4.py and scripts/analyze_level4.py. These had hardcoded paths, an EXCLUDED_GAMES list from ad-hoc instructions, an incompatible CSV schema (flat_*/hash_* columns), and a SIGKILL-on-timeout bug that discarded all solver output, recording nodes: 0. The redux and unwinnable experiment pairs are left intact pending orchestrator review — see commit message body for analysis. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Bring csv_schema.md up to full contract status: add explicit column numbers, source fields, full outcome-vocabulary flow table (solver JSON string → CSV value → bench-hook bucket), semantic distinction between TIMEOUT / TERMINATED / KILLED, and a conformance-gaps section listing five discrepancies found by code inspection (SIGTERM target, UNKNOWN not counted by hook, smart-solvability vocabulary, time_us rounding, cache_capacity blank sentinel). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Keep local Claude Code settings out of the repo (the file should stay deleted from tracking even though a local copy persists on disk), and ignore transient agent worktrees. Prevents the file re-entering the index as an unmerged conflict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Stage 1 step 7: START-HERE.md — single entry point answering "which script for which job" with copy-pasteable commands, a known-rough-edges warning box (worker-count safety, SIGTERM caveat, compare_benchmarks.py under review), and links to csv_schema.md and the rationalisation plan. Stage 1 step 8: script-inventory.md — table of every surviving benchmarking script with one-line job, inputs, outputs, and status (core / experiment / stage2-eval / dedup-candidate / analysis / out-of-scope). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Relocate docs/benchmarking/archive/ (9 superseded mac-dev-benchmark-enhancements and legacy comparison/implementation docs) to 01-Knowledge-Base/Archive/benchmarking-legacy-docs/ per the repo rule that completed/superseded docs live in the Knowledge-Base, not the solver repo. README now points at START-HERE and the active doc set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PICKUP records Stage 1 completion, kept-pair decisions, and carried-forward findings. Stage 2 plan breaks the work into a sequential kill-discipline/ concurrency spine plus parallel docs/decision tasks, with acceptance criteria (clean bench_multiplicity.sh run), an agent-brief template, and open questions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…GNU parallel) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…analysis/*.R) The script was already marked DEPRECATED at the top (pointing to design.md §7) and design.md §10 listed it as deprecated. The R analysis layer (benchmark.R, summary.R, compare_labels.R) fully covers its useful function — baseline vs current comparison — with superior statistics (bootstrap CI, Wilcoxon test, HTML scatter plot). The only unique capability (RHT hardware normalization) is a legacy concept not used in the current CSV-based workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

New scripts/bench_lib/process.py: run_with_deadline() spawns the solver in its own session (start_new_session=True) so the solver AND any /usr/bin/time wrapper share a process group. Solver --timeout is authoritative; Python deadline = 1.5x (D1, no floor/cap); on overrun SIGTERM the group -> 30s grace -> SIGKILL the group -> reap. Partial stdout is always captured; disposition distinguishes EXITED_OK/EXITED_ERR/KILLED_AFTER_SIGTERM/KILLED_HARD. 13 unit tests (fake child processes) pass. run_benchmark.py: run_solver() delegates to the helper (fixes the orphaned- grandchild bug); RSS falls back to solver_resident_bytes then 0, never fabricated; solver "failed" maps to FAILED (no bare UNKNOWN for real outcomes). csv_schema.md: reconciled with the above (FAILED value, 1.5x, RSS fallback, process-group kill); conformance gaps #1 and #3 marked resolved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bench_level5_unwinnable2.sh removed — one-off scratch variant (all standard run_variant calls commented out, only solvitaire-hash-only invocation kept); adds no reusable capability over the base script. tuesday-night-redux{,2}.sh kept — the 2 variant has a genuinely unique capability (runs all four solver variants with --label flags for per-variant comparison) but has a $SOLVER env-override design quirk and alina excluded; merging is not clean, so both are kept and each now carries a one-line header explaining what distinguishes it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Inventory: add bench_lib/process.py (core lib), drop removed compare_benchmarks.py, fold redux2 into experiments with note, record unwinnable2 removal, drop the dedup-candidate section. PICKUP: Stage 2 progress, remaining tasks, the worktree-stale-base process note, and the redux2 $SOLVER flaw flagged for Ian. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…l only) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… GNU parallel (T2/T3) T2 - benchmark_orchestrator.py:run_chunk() now calls bench_lib.process.run_with_deadline with a hard ceiling (seeds x timeout_ms/1000 x 1.5 x 1.5, min 120s). On overrun the entire process group is SIGTERM'd then SIGKILL'd; the chunk is recorded as failed; the pool continues and prints a clear message. T3 - multiprocessing.Pool replaced with GNU parallel (--jobs N --memfree 3G). Workers default to min(cpu_count//2, floor(total_RAM x 80% / 3 GB)); warn but do not refuse if --workers exceeds the safe ceiling. --full is now required to opt into GAME_CONFIGS_FULL (14-game matrix); default scope is GAME_CONFIGS_QUICK. --dry-run shows plan without needing binaries or parallel. bench_multiplicity.sh replaces xargs -P with parallel and gains the same memory-aware worker default. New files: scripts/bench_lib/concurrency.py (compute_jobs() pure function + get_total_ram_bytes) scripts/bench_lib/test_concurrency.py (19 unit tests; all pass) Updated files: scripts/benchmark_orchestrator.py scripts/experiments/bench_multiplicity.sh docs/benchmarking/active/START-HERE.md docs/benchmarking/active/script-inventory.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…idable kills Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ential deps Add docs/benchmarking/active/remote-runs.md (build + run on remote Linux, worker safety, clean-timeout notes, bench archiving). setup_remote.sh now installs GNU parallel and build-essential (apt) / parallel (brew) — they were missing and are required since Stage 2. START-HERE links to the remote guide. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Diagnoses why wall-based timeouts are load-dependent; proposes CPU-time budget (CLOCK_PROCESS_CPUTIME_ID) + solver wall safety-cap, optional --max-states for reproducible cutoffs. Decision pending (regenerating oracles is the main cost). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…terminated not killed) Root-caused the remote 'many KILLED' runs (small --timeout under parallel load): - Solver only handled SIGINT, so the wrapper's SIGTERM killed it with no output -> KILLED. Now solver.cpp registers SIGTERM with the same handler (graceful: sets interrupt flag -> DFS returns TERMINATED -> JSON flushed). SIGINT (^C) behaviour unchanged. main.cpp emits solution_type "terminated" for TERMINATED. - Wrapper grace was 1.5x timeout, too tight vs fixed per-run overhead (dominated by default-cache mmap, ~0 CPU). Now total_wait = timeout + max(0.5*timeout, min_grace_s=10s); a 1000ms timeout gets an 11s window. SIGTERM stays the signal, 30s grace, then SIGKILL. - run_benchmark.py maps "terminated" -> TERMINATED; csv_schema.md updated. Verified: unit_tests pass; regression_level1 (x4 variants) pass; 33 bench_lib tests pass; direct SIGTERM -> exit 0 + "terminated" JSON with stats; phase-C smoke at --timeout 1000 -> 14 SOLVED / 6 TIMEOUT / 0 KILLED. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…build quirk Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

No logic bug found in build.sh (it always builds the 'solvitaire' target and incremental rebuilds work — verified). The real failure mode is a silently STALE binary when incremental detection is fooled (e.g. sources copied between trees, which is how the orchestrator integrates sub-agent work). Two safeguards: - --clean: wipe cmake-build-<cfg>/ for a guaranteed-fresh build. - end-of-build summary lists each binary with size + mtime, so a stale/older binary is visible at a glance (portable GNU/BSD stat). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Sub-agents/runs left compiled .pyc tracked (incl. a stale one from the removed compare_benchmarks.py). Untrack all and ignore Python bytecode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

When a run ends non-clean, run_solver now emits a [kill-diag] line to stderr identifying the cause: exit signal (-9 SIGKILL/OOM, -6 SIGABRT/bad_alloc, -15 unhandled SIGTERM), whether the wrapper or something external killed it, wall time, stdout bytes, and the solver's stderr tail. Script-only change — no rebuild needed; usable immediately on the remote box to diagnose the KILLED runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…run ratio, RSS Earlier version mislabelled KILLED_HARD+SIGTERM as 'stale binary' — wrong (the solver's SIGTERM handler works; that combo is a swap-delayed shutdown). Now keyed on disposition first, and reports the wall/timeout overrun ratio + peak RSS, which are the real tells of memory-pressure/swap (solver running far past its deadline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…xternal kills /usr/bin/time/bash surface a signalled child as 128+signum, not a negative rc, so 137 (SIGKILL) was printed as a bare 'exit code 137'. Now decode it, and compare wall to the wrapper's own deadline: a SIGKILL BEFORE that deadline is external (OOM killer or 'parallel --memfree' kill+requeue), not the wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…EMFREE --memfree does NOT just pause — it KILLS+requeues the youngest job when free mem drops below 50% of the size, producing exit-137-mid-run + reruns (a prime suspect for the remote KILLs). Default it OFF; bound concurrency via an accurate --jobs instead. Set BENCH_MEMFREE=3G to A/B test the old behaviour. Fixes the misleading 'just pauses' comment + banner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lvers The solver runs in its own session (start_new_session, for timeout killpg), so an external ^C never reached it — when parallel/python died the solver orphaned and persisted. Fixes: - run_with_deadline: a BaseException handler SIGKILLs the child's process group if the wait is interrupted, so an interrupted run never leaves a solver behind. - run_benchmark.py: SIGTERM is turned into KeyboardInterrupt so a parent's kill triggers the same teardown as ^C. - bench_multiplicity.sh: trap INT/TERM to signal the whole process group (stops parallel launching more, propagates to workers). Verified: SIGINT to a running phase-C batch (6 live solver procs) leaves 0 survivors; bench_lib unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Confirmed root cause: cgroup memory.max on the user slice (64 GiB), not host RAM. 32 multiplicity workers (~6.4GB each = (capacity/2)*128B mmap) exceeded it -> cgroup OOM-killed solvitaire; 16 rode the edge. The old worker math assumed a flat 3GB/ worker against host RAM, so it allowed far too many. concurrency.py: - effective_memory_limit(): reads the binding limit = min(host RAM, cgroup memory.max), walking the cgroup v2 tree (finds user.slice limits) + v1 fallback. - per_worker_bytes(cache_type, capacity): exact flat-family reservation (capacity/2 * cluster_bytes: flat 64, hash-only 16, predecessor/multiplicity 128) + overhead; LRU measured ~320 B/entry (heap-grown, workload-bounded). - planning_worker_bytes(): a mixed run is sized by the flat-family reservation (always-resident, the OOM culprit), not LRU's worst case. - compute_jobs(requested, limit, worker_bytes): clamp to memory-safe max + warn. bench_multiplicity.sh + benchmark_orchestrator.py: use the above (cache types from the selected phases/solvers); print the detected limit + per-worker + max_safe; and DROP parallel --memfree (it killed+requeued jobs) now that --jobs is accurate. On a 64 GiB slice, phase C now auto-sizes to ~7 multiplicity workers and clamps larger --workers with an OOM warning. 29 bench_lib tests pass; dry-run + a real phase-C run verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

solver_node carried boost::optional<lru_cache::item_list::iterator> (16 B) on every DFS frontier frame, but it is assigned only on the LRU path (Policy::computes_hash == false; used for set_non_live live-bit tracking). Move it into an empty-base cache_state_holder so flat-family nodes omit it entirely: solver_node shrinks 48 -> 32 B for flat/multiplicity, unchanged at 48 B for LRU. Guarded the one unconditional cache_state read in revert_to_last_node_with_children with if constexpr. Correctness-neutral (verified): release + debug unit_tests, regression_level1 (default/flat/hash-only/lru), trace_identity_flat/lru, trace_until_timeout, and SearchTraceAgreement all pass. Measured ~3-4% peak-RSS reduction on deep free-cell searches — modest, because per-frame memory is dominated by the separately heap-allocated child_moves vector (see KI-26), not the node struct. Docs: adds KI-26 (non-cache frontier memory = child_moves x runaway depth, deferred algorithmic work) and KI-27 (trace_regression_level1/2 fail against a stale reference binary, pre-existing and unrelated to this change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Make the search budget load-invariant. --timeout is now measured in CPU time (user+system, CLOCK_PROCESS_CPUTIME_ID) instead of wall clock, so a run gets the same compute regardless of scheduler contention — essential for fair benchmark comparison under parallel load. Two companions: * --wall-cap-mult (default 10): a wall safety-cap (wall_cap_mult x timeout) so the solver always self-terminates even if badly descheduled. * --max-states (default 0=off): a deterministic, machine/load-independent cutoff for bit-for-bit reproducible comparisons. Budget checks are batched every 4096 nodes so the (syscall) CPU-clock read does not slow the DFS hot loop. Wrapper/orchestrator: run_benchmark.py's deadline + overrun diagnostics and the orchestrator chunk-ceiling now key off the solver's WALL ceiling (wall_cap_mult x timeout), not the CPU budget — otherwise a legitimately-descheduled run is killed. csv_schema.md: timeout_ms is CPU ms; time_us remains wall. Oracles: L2/L3 regenerated under the new semantics (all-definitive, exactly reproducible; verified green with --enforce-node-counts, default+flat+hash-only+lru). L1 left unchanged (timeout-insensitive). L4/L5 NOT regenerated — their outcome-only comparison is untrustworthy with the unsound `both` streamliner (see KI-29). Decisions (with Ian): CPU=user+sys; add --max-states; K=10; switch now + regenerate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…igation - KI-27: trace_regression_level1/2 red against a stale reference binary (commit not in HEAD history); pre-existing, unrelated to recent changes — needs re-baseline triage. - KI-28: benchmark worker sizing cannot bound a single runaway worker (cache-based budget vs the unbounded child_moves frontier under a 64 GiB cgroup); proposes per-worker systemd-run MemoryMax. - KI-29: L4/L5 outcome-only regression is untrustworthy with the unsound `both` streamliner (outcome can vary by search ordering); options to fix. - PICKUP: record the OOM root-cause characterisation, cache_state fix, CPU-time timeout, oracle regen status, and the new open decisions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…linux pending) trace_regression_level1/2 were red against a stale reference (commit no longer in branch history after PR squash-merges). Triage confirmed the divergence is path-only — correct outcomes, and the current binary's node counts match the validated level-1 oracle; the reference simply predated the KI-23 / Stage-6 multiplicity suit-symmetry changes. So this re-baselines the reference rather than chasing a non-bug. Repoints the CMake TRACE_REF_BIN default (mac + linux) to dated names built from this binary. The mac reference (solvitaire-trace-reference-mac-arm64-20260603-7eb5883) is built and verified: trace_regression_level1 (36s) and level2 (184s) both pass. NOTE: existing cmake-build-trace dirs cache the old TRACE_REF_BIN — clear with `cmake -DTRACE_REF_BIN=… cmake-build-trace` or delete the build dir. Reference binaries live in (untracked) 05-Executables/reference and are distributed out-of-band. The Linux ARM64 reference (same dated name) still needs a container rebuild (`scripts/container-build.sh --extract-trace-binary`); see 05-Executables/reference/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…host verification Triage confirmed legitimate path-only drift, not a regression. Reference re-baselined (36162ea); trace_regression_level1/2 pass on mac. Linux arm64 binary built via container, verification on a Linux host still pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…Dockerfile) Apptainer can't consume the Dockerfile directly, so this mirrors it for hosts that have apptainer (HPC / shared boxes) rather than container/docker/podman. Builds the release + trace configs; includes build/run/extract commands in the header. container-build.sh does not drive apptainer — this is a manual companion. Build: apptainer build --fakeroot solvitaire.sif solvitaire.def Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…chmark runs) bench_multiplicity.sh requires GNU parallel as its worker engine, but neither the Dockerfile nor solvitaire.def installed it. Add it to both so the benchmark scripts can run inside the container (apptainer/docker), writing results to a bound host dir via --outdir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

On many-core machines the container builds were doubly serial: `cmake --build` had no --parallel (source files compiled one at a time) and `-flto` (plain) ran LTRANS serially ("using serial compilation of N LTRANS jobs"). Add `--parallel` to the cmake --build invocations in Dockerfile + solvitaire.def, and switch CMAKE_CXX_FLAGS_RELEASE to -flto=auto so link-time optimisation uses all cores. Build-only change: behaviour, node counts/oracles and traces are unaffected (verified a release build + klondike seed 1 = 158295 states, unchanged). Caveat: parallel compile + parallel LTO spike build-time RAM; cap with `--parallel N` on small nodes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Documents Option C — building solvitaire.def into a SIF and running bench_multiplicity.sh in-container with --outdir to a host-bound dir, plus the apptainer/SLURM specifics: cgroup-aware sizing detects the allocation limit, container-build.sh doesn't drive apptainer, the phase-D LRU per-worker memory caveat (KI-28), --fakeroot, and not running ctest in the read-only SIF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ations black-hole is inherently suit-symmetric, so the right streamliner is suit-symmetry (not auto-foundations). Updated all 4 phase definitions (A/B/C/D) and the header comment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ld.sh container-build.sh was OCI-only (build -t / run -m). Add: - Runtime detection for apptainer/singularity (RUNTIME_KIND oci|apptainer); override via CONTAINER_RUNTIME. Apptainer builds the .sif from solvitaire.def and runs the existing flags via `apptainer exec`. OCI path preserved unchanged. - --editable: build the image ONCE (reuse if present) and bind the live host repo at /workspace, building binaries into the host-bound cmake-build-* dirs. Script edits need no rebuild; C++ edits only an incremental compile. For dev/benchmarking, not production. - Refactored run/build into build_image() + run_in() helpers branching on runtime. - Fixed the stale Linux trace reference path to the KI-27 re-baseline name (…-20260603-7eb5883). APPTAINER_BUILD_OPTS (default --fakeroot) for unprivileged builds. remote-runs.md updated (the script now drives apptainer; --editable documented; caveat that editable shares host cmake-build-* dirs — don't mix cross-arch builds in one checkout). NOTE: syntax/--help verified on macOS; the apptainer and OCI-editable run paths are NOT exercised here (they trigger real builds) — verify on the Linux/apptainer box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… benchmark-rationalisation All conflicts resolved in favour of this branch: - CMakeLists.txt: newer trace ref (-20260603-7eb5883) + -flto=auto + CPU-time comment - container-build.sh: full Apptainer/editable rewrite supersedes dev's 2-line ref update - level2/3 oracles (main + hash_only): CPU-time regen already incorporates KI-23 (binary had inherent_suit_symmetry); our versions are a superset # Conflicts: # CMakeLists.txt # scripts/container-build.sh # tests/oracles/level2.json # tests/oracles/level2_hash_only.json # tests/oracles/level3.json # tests/oracles/level3_hash_only.json

turingfan and others added 30 commits May 29, 2026 13:07

docs(bench): resolve Stage 2 open questions (grace 1.5x, 3GB/worker, …

9ed4ddf

…GNU parallel) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: fix CLAUDE.md streamliner token (smart -> smart-solvability)

7286ad6

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(bench): record decision to leave redux2 $SOLVER quirk (historica…

295687d

…l only) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(bench): PICKUP — mark T2/T3 done, update remaining Stage 2

dcc3c40

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(bench): T11 acceptance PASSED — bench_multiplicity clean, no avo…

cbbae9f

…idable kills Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(bench): PICKUP — record remote KILLED fix + CPU/wall proposal + …

98d426f

…build quirk Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore: untrack __pycache__ bytecode and gitignore it

652490e

Sub-agents/runs left compiled .pyc tracked (incl. a stale one from the removed compare_benchmarks.py). Untrack all and ignore Python bytecode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

turingfan and others added 10 commits June 2, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers#9

benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers#9
turingfan wants to merge 40 commits into
devfrom
benchmark-rationalisation

turingfan commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

turingfan commented Jun 12, 2026

Summary

Open items (deferred, not blocking)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant