Skip to content

benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers#9

Open
turingfan wants to merge 40 commits into
devfrom
benchmark-rationalisation
Open

benchmark-rationalisation: tooling cleanup, kill discipline, CPU-time timeout, containers#9
turingfan wants to merge 40 commits into
devfrom
benchmark-rationalisation

Conversation

@turingfan

Copy link
Copy Markdown
Owner

Summary

Rationalises and robustifies the benchmark tooling in the solver repo. Two stages of work across ~40 commits.

Stage 1 — Cleanup

  • Removed one-off scripts (benchmark_level4.py, analyze_level4.py, deprecated compare_benchmarks.py)
  • Authoritative csv_schema.md contract (outcome vocabulary, solution_type values)
  • New START-HERE.md + script-inventory.md guides; superseded docs archived to KB
  • .gitignore: .claude/settings.local.json + __pycache__

Stage 2 — Kill discipline, safety, CPU-time timeout

  • Shared process-group kill discipline (scripts/bench_lib/process.py): solver and wrapper share a POSIX process group; on overrun SIGTERM→grace→SIGKILL; always captures partial JSON output. Fixes lost-work-on-kill.
  • --timeout is now CPU time (CLOCK_PROCESS_CPUTIME_ID, user+sys) + 10x wall safety-cap (--wall-cap-mult) + deterministic --max-states cutoff. Load-invariant; L2/L3 oracles regenerated under CPU semantics (--enforce-node-counts).
  • Solver handles SIGTERM gracefully: registers SIGTERM handler, flushes JSON with solution_type:"terminated" before exit.
  • Bounded memory-aware concurrency via GNU parallel: ~3 GB/worker budget, floor(RAM*0.80/3GB) cap, --memfree safety net. Replaces cpu_count() fork-bomb default.
  • Per-chunk orchestrator timeout with process-group kill so wedged chunks cannot stall a run.
  • cache_state dead-field drop: removed unused LRU iterator from flat/multiplicity DFS frames (solver_node 48->32 B); ~3-4% RSS reduction on deep searches.
  • Apptainer/Singularity support: solvitaire.def + apptainer path in container-build.sh + --editable mode (bind live host repo, no rebuild on edits). GNU parallel added to both images; builds parallelised (--parallel, -flto=auto).
  • Kill diagnostics: [kill-diag] logging with signal, RSS, overrun ratio on every non-clean exit.
  • KI-27 resolved (mac): trace reference re-baselined to 20260603-7eb5883; trace_regression_level1/2 pass. Linux verification pending.
  • Remote-runs guide (docs/benchmarking/active/remote-runs.md): copy-pasteable build + benchmark workflow for a remote Linux box, including Apptainer Option C.

Acceptance (T11): bench_multiplicity.sh --phase D --seeds 1-5 --games klondike at both 3s and 300ms timeouts — zero KILLED/TERMINATED/FAILED; timeouts return clean TIMEOUT with full partial stats.

Open items (deferred, not blocking)

  • KI-26: non-cache DFS frontier dominates RAM on hard instances (algorithmic, out of scope)
  • KI-28: per-worker systemd-run MemoryMax so a single runaway dies cleanly (deferred)
  • KI-29: L4/L5 outcome-only regression untrustworthy with unsound both streamliner (methodology decision pending; L4/L5 oracles deliberately not regenerated)
  • Linux trace-ref verification still pending on a Linux host

Test plan

  • Gate 1 (release): unit_tests + regression_level1 x4 — all pass
  • Gate 2 (trace): unit_tests + trace_identity + trace_until_timeout + trace_mult_vs_flat x7 + trace_regression_level1 (150/150) + trace_regression_level2 (160/160) — all pass
  • Gate 3 (debug): unit_tests — pass
  • T11 acceptance: bench_multiplicity.sh clean run, zero avoidable kills

🤖 Generated with Claude Code

turingfan and others added 30 commits May 29, 2026 13:07
Delete scripts/benchmark_level4.py and scripts/analyze_level4.py.
These had hardcoded paths, an EXCLUDED_GAMES list from ad-hoc instructions,
an incompatible CSV schema (flat_*/hash_* columns), and a SIGKILL-on-timeout
bug that discarded all solver output, recording nodes: 0.

The redux and unwinnable experiment pairs are left intact pending orchestrator
review — see commit message body for analysis.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bring csv_schema.md up to full contract status: add explicit column
numbers, source fields, full outcome-vocabulary flow table (solver JSON
string → CSV value → bench-hook bucket), semantic distinction between
TIMEOUT / TERMINATED / KILLED, and a conformance-gaps section listing
five discrepancies found by code inspection (SIGTERM target, UNKNOWN
not counted by hook, smart-solvability vocabulary, time_us rounding,
cache_capacity blank sentinel).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep local Claude Code settings out of the repo (the file should stay
deleted from tracking even though a local copy persists on disk), and
ignore transient agent worktrees. Prevents the file re-entering the
index as an unmerged conflict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stage 1 step 7: START-HERE.md — single entry point answering "which
script for which job" with copy-pasteable commands, a known-rough-edges
warning box (worker-count safety, SIGTERM caveat, compare_benchmarks.py
under review), and links to csv_schema.md and the rationalisation plan.

Stage 1 step 8: script-inventory.md — table of every surviving
benchmarking script with one-line job, inputs, outputs, and status
(core / experiment / stage2-eval / dedup-candidate / analysis /
out-of-scope).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Relocate docs/benchmarking/archive/ (9 superseded mac-dev-benchmark-enhancements
and legacy comparison/implementation docs) to
01-Knowledge-Base/Archive/benchmarking-legacy-docs/ per the repo rule that
completed/superseded docs live in the Knowledge-Base, not the solver repo.
README now points at START-HERE and the active doc set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PICKUP records Stage 1 completion, kept-pair decisions, and carried-forward
findings. Stage 2 plan breaks the work into a sequential kill-discipline/
concurrency spine plus parallel docs/decision tasks, with acceptance criteria
(clean bench_multiplicity.sh run), an agent-brief template, and open questions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…GNU parallel)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…analysis/*.R)

The script was already marked DEPRECATED at the top (pointing to design.md §7)
and design.md §10 listed it as deprecated. The R analysis layer (benchmark.R,
summary.R, compare_labels.R) fully covers its useful function — baseline vs
current comparison — with superior statistics (bootstrap CI, Wilcoxon test,
HTML scatter plot). The only unique capability (RHT hardware normalization) is
a legacy concept not used in the current CSV-based workflow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New scripts/bench_lib/process.py: run_with_deadline() spawns the solver in its
own session (start_new_session=True) so the solver AND any /usr/bin/time wrapper
share a process group. Solver --timeout is authoritative; Python deadline = 1.5x
(D1, no floor/cap); on overrun SIGTERM the group -> 30s grace -> SIGKILL the group
-> reap. Partial stdout is always captured; disposition distinguishes
EXITED_OK/EXITED_ERR/KILLED_AFTER_SIGTERM/KILLED_HARD. 13 unit tests (fake child
processes) pass.

run_benchmark.py: run_solver() delegates to the helper (fixes the orphaned-
grandchild bug); RSS falls back to solver_resident_bytes then 0, never fabricated;
solver "failed" maps to FAILED (no bare UNKNOWN for real outcomes).

csv_schema.md: reconciled with the above (FAILED value, 1.5x, RSS fallback,
process-group kill); conformance gaps #1 and #3 marked resolved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bench_level5_unwinnable2.sh removed — one-off scratch variant (all
standard run_variant calls commented out, only solvitaire-hash-only
invocation kept); adds no reusable capability over the base script.

tuesday-night-redux{,2}.sh kept — the 2 variant has a genuinely unique
capability (runs all four solver variants with --label flags for
per-variant comparison) but has a $SOLVER env-override design quirk and
alina excluded; merging is not clean, so both are kept and each now
carries a one-line header explaining what distinguishes it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inventory: add bench_lib/process.py (core lib), drop removed compare_benchmarks.py,
fold redux2 into experiments with note, record unwinnable2 removal, drop the
dedup-candidate section. PICKUP: Stage 2 progress, remaining tasks, the
worktree-stale-base process note, and the redux2 $SOLVER flaw flagged for Ian.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l only)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… GNU parallel (T2/T3)

T2 - benchmark_orchestrator.py:run_chunk() now calls bench_lib.process.run_with_deadline
with a hard ceiling (seeds x timeout_ms/1000 x 1.5 x 1.5, min 120s). On overrun the
entire process group is SIGTERM'd then SIGKILL'd; the chunk is recorded as failed; the
pool continues and prints a clear message.

T3 - multiprocessing.Pool replaced with GNU parallel (--jobs N --memfree 3G). Workers
default to min(cpu_count//2, floor(total_RAM x 80% / 3 GB)); warn but do not refuse if
--workers exceeds the safe ceiling. --full is now required to opt into GAME_CONFIGS_FULL
(14-game matrix); default scope is GAME_CONFIGS_QUICK. --dry-run shows plan without
needing binaries or parallel. bench_multiplicity.sh replaces xargs -P with parallel and
gains the same memory-aware worker default.

New files:
  scripts/bench_lib/concurrency.py (compute_jobs() pure function + get_total_ram_bytes)
  scripts/bench_lib/test_concurrency.py (19 unit tests; all pass)

Updated files:
  scripts/benchmark_orchestrator.py
  scripts/experiments/bench_multiplicity.sh
  docs/benchmarking/active/START-HERE.md
  docs/benchmarking/active/script-inventory.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…idable kills

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ential deps

Add docs/benchmarking/active/remote-runs.md (build + run on remote Linux, worker
safety, clean-timeout notes, bench archiving). setup_remote.sh now installs GNU
parallel and build-essential (apt) / parallel (brew) — they were missing and are
required since Stage 2. START-HERE links to the remote guide.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnoses why wall-based timeouts are load-dependent; proposes CPU-time budget
(CLOCK_PROCESS_CPUTIME_ID) + solver wall safety-cap, optional --max-states for
reproducible cutoffs. Decision pending (regenerating oracles is the main cost).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…terminated not killed)

Root-caused the remote 'many KILLED' runs (small --timeout under parallel load):
- Solver only handled SIGINT, so the wrapper's SIGTERM killed it with no output
  -> KILLED. Now solver.cpp registers SIGTERM with the same handler (graceful:
  sets interrupt flag -> DFS returns TERMINATED -> JSON flushed). SIGINT (^C)
  behaviour unchanged. main.cpp emits solution_type "terminated" for TERMINATED.
- Wrapper grace was 1.5x timeout, too tight vs fixed per-run overhead (dominated
  by default-cache mmap, ~0 CPU). Now total_wait = timeout + max(0.5*timeout,
  min_grace_s=10s); a 1000ms timeout gets an 11s window. SIGTERM stays the signal,
  30s grace, then SIGKILL.
- run_benchmark.py maps "terminated" -> TERMINATED; csv_schema.md updated.

Verified: unit_tests pass; regression_level1 (x4 variants) pass; 33 bench_lib
tests pass; direct SIGTERM -> exit 0 + "terminated" JSON with stats; phase-C
smoke at --timeout 1000 -> 14 SOLVED / 6 TIMEOUT / 0 KILLED.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…build quirk

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No logic bug found in build.sh (it always builds the 'solvitaire' target and
incremental rebuilds work — verified). The real failure mode is a silently STALE
binary when incremental detection is fooled (e.g. sources copied between trees,
which is how the orchestrator integrates sub-agent work). Two safeguards:
- --clean: wipe cmake-build-<cfg>/ for a guaranteed-fresh build.
- end-of-build summary lists each binary with size + mtime, so a stale/older
  binary is visible at a glance (portable GNU/BSD stat).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sub-agents/runs left compiled .pyc tracked (incl. a stale one from the removed
compare_benchmarks.py). Untrack all and ignore Python bytecode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a run ends non-clean, run_solver now emits a [kill-diag] line to stderr
identifying the cause: exit signal (-9 SIGKILL/OOM, -6 SIGABRT/bad_alloc,
-15 unhandled SIGTERM), whether the wrapper or something external killed it,
wall time, stdout bytes, and the solver's stderr tail. Script-only change — no
rebuild needed; usable immediately on the remote box to diagnose the KILLED runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…run ratio, RSS

Earlier version mislabelled KILLED_HARD+SIGTERM as 'stale binary' — wrong (the
solver's SIGTERM handler works; that combo is a swap-delayed shutdown). Now keyed
on disposition first, and reports the wall/timeout overrun ratio + peak RSS, which
are the real tells of memory-pressure/swap (solver running far past its deadline).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…xternal kills

/usr/bin/time/bash surface a signalled child as 128+signum, not a negative rc, so
137 (SIGKILL) was printed as a bare 'exit code 137'. Now decode it, and compare
wall to the wrapper's own deadline: a SIGKILL BEFORE that deadline is external
(OOM killer or 'parallel --memfree' kill+requeue), not the wrapper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…EMFREE

--memfree does NOT just pause — it KILLS+requeues the youngest job when free mem
drops below 50% of the size, producing exit-137-mid-run + reruns (a prime suspect
for the remote KILLs). Default it OFF; bound concurrency via an accurate --jobs
instead. Set BENCH_MEMFREE=3G to A/B test the old behaviour. Fixes the misleading
'just pauses' comment + banner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lvers

The solver runs in its own session (start_new_session, for timeout killpg), so an
external ^C never reached it — when parallel/python died the solver orphaned and
persisted. Fixes:
- run_with_deadline: a BaseException handler SIGKILLs the child's process group if
  the wait is interrupted, so an interrupted run never leaves a solver behind.
- run_benchmark.py: SIGTERM is turned into KeyboardInterrupt so a parent's kill
  triggers the same teardown as ^C.
- bench_multiplicity.sh: trap INT/TERM to signal the whole process group (stops
  parallel launching more, propagates to workers).
Verified: SIGINT to a running phase-C batch (6 live solver procs) leaves 0
survivors; bench_lib unit tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Confirmed root cause: cgroup memory.max on the user slice (64 GiB), not host RAM.
32 multiplicity workers (~6.4GB each = (capacity/2)*128B mmap) exceeded it -> cgroup
OOM-killed solvitaire; 16 rode the edge. The old worker math assumed a flat 3GB/
worker against host RAM, so it allowed far too many.

concurrency.py:
- effective_memory_limit(): reads the binding limit = min(host RAM, cgroup
  memory.max), walking the cgroup v2 tree (finds user.slice limits) + v1 fallback.
- per_worker_bytes(cache_type, capacity): exact flat-family reservation
  (capacity/2 * cluster_bytes: flat 64, hash-only 16, predecessor/multiplicity 128)
  + overhead; LRU measured ~320 B/entry (heap-grown, workload-bounded).
- planning_worker_bytes(): a mixed run is sized by the flat-family reservation
  (always-resident, the OOM culprit), not LRU's worst case.
- compute_jobs(requested, limit, worker_bytes): clamp to memory-safe max + warn.

bench_multiplicity.sh + benchmark_orchestrator.py: use the above (cache types from
the selected phases/solvers); print the detected limit + per-worker + max_safe; and
DROP parallel --memfree (it killed+requeued jobs) now that --jobs is accurate.

On a 64 GiB slice, phase C now auto-sizes to ~7 multiplicity workers and clamps
larger --workers with an OOM warning. 29 bench_lib tests pass; dry-run + a real
phase-C run verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
solver_node carried boost::optional<lru_cache::item_list::iterator> (16 B)
on every DFS frontier frame, but it is assigned only on the LRU path
(Policy::computes_hash == false; used for set_non_live live-bit tracking).
Move it into an empty-base cache_state_holder so flat-family nodes omit it
entirely: solver_node shrinks 48 -> 32 B for flat/multiplicity, unchanged
at 48 B for LRU. Guarded the one unconditional cache_state read in
revert_to_last_node_with_children with if constexpr.

Correctness-neutral (verified): release + debug unit_tests, regression_level1
(default/flat/hash-only/lru), trace_identity_flat/lru, trace_until_timeout,
and SearchTraceAgreement all pass. Measured ~3-4% peak-RSS reduction on deep
free-cell searches — modest, because per-frame memory is dominated by the
separately heap-allocated child_moves vector (see KI-26), not the node struct.

Docs: adds KI-26 (non-cache frontier memory = child_moves x runaway depth,
deferred algorithmic work) and KI-27 (trace_regression_level1/2 fail against a
stale reference binary, pre-existing and unrelated to this change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the search budget load-invariant. --timeout is now measured in CPU time
(user+system, CLOCK_PROCESS_CPUTIME_ID) instead of wall clock, so a run gets the
same compute regardless of scheduler contention — essential for fair benchmark
comparison under parallel load. Two companions:
  * --wall-cap-mult (default 10): a wall safety-cap (wall_cap_mult x timeout) so
    the solver always self-terminates even if badly descheduled.
  * --max-states (default 0=off): a deterministic, machine/load-independent cutoff
    for bit-for-bit reproducible comparisons.
Budget checks are batched every 4096 nodes so the (syscall) CPU-clock read does not
slow the DFS hot loop.

Wrapper/orchestrator: run_benchmark.py's deadline + overrun diagnostics and the
orchestrator chunk-ceiling now key off the solver's WALL ceiling (wall_cap_mult x
timeout), not the CPU budget — otherwise a legitimately-descheduled run is killed.
csv_schema.md: timeout_ms is CPU ms; time_us remains wall.

Oracles: L2/L3 regenerated under the new semantics (all-definitive, exactly
reproducible; verified green with --enforce-node-counts, default+flat+hash-only+lru).
L1 left unchanged (timeout-insensitive). L4/L5 NOT regenerated — their outcome-only
comparison is untrustworthy with the unsound `both` streamliner (see KI-29).

Decisions (with Ian): CPU=user+sys; add --max-states; K=10; switch now + regenerate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
turingfan and others added 10 commits June 2, 2026 13:47
…igation

- KI-27: trace_regression_level1/2 red against a stale reference binary (commit not
  in HEAD history); pre-existing, unrelated to recent changes — needs re-baseline triage.
- KI-28: benchmark worker sizing cannot bound a single runaway worker (cache-based
  budget vs the unbounded child_moves frontier under a 64 GiB cgroup); proposes
  per-worker systemd-run MemoryMax.
- KI-29: L4/L5 outcome-only regression is untrustworthy with the unsound `both`
  streamliner (outcome can vary by search ordering); options to fix.
- PICKUP: record the OOM root-cause characterisation, cache_state fix, CPU-time
  timeout, oracle regen status, and the new open decisions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…linux pending)

trace_regression_level1/2 were red against a stale reference (commit no longer in branch
history after PR squash-merges). Triage confirmed the divergence is path-only — correct
outcomes, and the current binary's node counts match the validated level-1 oracle; the
reference simply predated the KI-23 / Stage-6 multiplicity suit-symmetry changes. So this
re-baselines the reference rather than chasing a non-bug.

Repoints the CMake TRACE_REF_BIN default (mac + linux) to dated names built from this
binary. The mac reference (solvitaire-trace-reference-mac-arm64-20260603-7eb5883) is built
and verified: trace_regression_level1 (36s) and level2 (184s) both pass. NOTE: existing
cmake-build-trace dirs cache the old TRACE_REF_BIN — clear with
`cmake -DTRACE_REF_BIN=… cmake-build-trace` or delete the build dir.

Reference binaries live in (untracked) 05-Executables/reference and are distributed
out-of-band. The Linux ARM64 reference (same dated name) still needs a container rebuild
(`scripts/container-build.sh --extract-trace-binary`); see 05-Executables/reference/README.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…host verification

Triage confirmed legitimate path-only drift, not a regression. Reference re-baselined
(36162ea); trace_regression_level1/2 pass on mac. Linux arm64 binary built via container,
verification on a Linux host still pending.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Dockerfile)

Apptainer can't consume the Dockerfile directly, so this mirrors it for hosts that
have apptainer (HPC / shared boxes) rather than container/docker/podman. Builds the
release + trace configs; includes build/run/extract commands in the header.
container-build.sh does not drive apptainer — this is a manual companion.

Build: apptainer build --fakeroot solvitaire.sif solvitaire.def

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…chmark runs)

bench_multiplicity.sh requires GNU parallel as its worker engine, but neither the
Dockerfile nor solvitaire.def installed it. Add it to both so the benchmark scripts
can run inside the container (apptainer/docker), writing results to a bound host dir
via --outdir.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On many-core machines the container builds were doubly serial: `cmake --build` had no
--parallel (source files compiled one at a time) and `-flto` (plain) ran LTRANS serially
("using serial compilation of N LTRANS jobs"). Add `--parallel` to the cmake --build
invocations in Dockerfile + solvitaire.def, and switch CMAKE_CXX_FLAGS_RELEASE to
-flto=auto so link-time optimisation uses all cores.

Build-only change: behaviour, node counts/oracles and traces are unaffected (verified a
release build + klondike seed 1 = 158295 states, unchanged). Caveat: parallel compile +
parallel LTO spike build-time RAM; cap with `--parallel N` on small nodes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documents Option C — building solvitaire.def into a SIF and running
bench_multiplicity.sh in-container with --outdir to a host-bound dir, plus the
apptainer/SLURM specifics: cgroup-aware sizing detects the allocation limit,
container-build.sh doesn't drive apptainer, the phase-D LRU per-worker memory
caveat (KI-28), --fakeroot, and not running ctest in the read-only SIF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ations

black-hole is inherently suit-symmetric, so the right streamliner is suit-symmetry
(not auto-foundations). Updated all 4 phase definitions (A/B/C/D) and the header comment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ld.sh

container-build.sh was OCI-only (build -t / run -m). Add:
- Runtime detection for apptainer/singularity (RUNTIME_KIND oci|apptainer); override via
  CONTAINER_RUNTIME. Apptainer builds the .sif from solvitaire.def and runs the existing
  flags via `apptainer exec`. OCI path preserved unchanged.
- --editable: build the image ONCE (reuse if present) and bind the live host repo at
  /workspace, building binaries into the host-bound cmake-build-* dirs. Script edits need
  no rebuild; C++ edits only an incremental compile. For dev/benchmarking, not production.
- Refactored run/build into build_image() + run_in() helpers branching on runtime.
- Fixed the stale Linux trace reference path to the KI-27 re-baseline name
  (…-20260603-7eb5883). APPTAINER_BUILD_OPTS (default --fakeroot) for unprivileged builds.

remote-runs.md updated (the script now drives apptainer; --editable documented; caveat
that editable shares host cmake-build-* dirs — don't mix cross-arch builds in one checkout).

NOTE: syntax/--help verified on macOS; the apptainer and OCI-editable run paths are NOT
exercised here (they trigger real builds) — verify on the Linux/apptainer box.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… benchmark-rationalisation

All conflicts resolved in favour of this branch:
- CMakeLists.txt: newer trace ref (-20260603-7eb5883) + -flto=auto + CPU-time comment
- container-build.sh: full Apptainer/editable rewrite supersedes dev's 2-line ref update
- level2/3 oracles (main + hash_only): CPU-time regen already incorporates KI-23 (binary had inherent_suit_symmetry); our versions are a superset

# Conflicts:
#	CMakeLists.txt
#	scripts/container-build.sh
#	tests/oracles/level2.json
#	tests/oracles/level2_hash_only.json
#	tests/oracles/level3.json
#	tests/oracles/level3_hash_only.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant