S5: sweep runner + overlapping-variance stats + committed 30-seed results (milestone-12a) by CooperBigFoot · Pull Request #17 · hydrosolutions/ctrl-freak

CooperBigFoot · 2026-06-26T13:03:57Z

s5-runner (milestone-12a) — sweep + overlapping-variance stats + committed results

The reproducibility + statistics layer. benchmarks/run.py sweeps 12 problems × 3 libraries × 30 seeds, applies the metrics, aggregates mean±std + success_rate, computes the overlapping-variance equivalence verdict, and writes a self-describing JSON artifact. benchmarks/stats.py is the pure-numpy equivalence test (|mean_cf − mean_lib| < max(std_cf, std_lib), with reported margin — no t-test, no scipy).

Files

benchmarks/run.py (NEW) — the sweep + JSON writer (not pytest-collected).
benchmarks/stats.py (NEW) — overlapping-variance equivalence.
benchmarks/results/benchmark_results.json (NEW, committed) — the 30-seed artifact (1.0 MiB) so every reported number traces to a file. Embeds the 30 seeds, pinned versions (ctrl_freak 0.1.0, pymoo 0.6.1.6, deap 1.4.3, numpy 2.4.1), budgets, the statistic definition, raw per-seed, aggregates, equivalence verdicts+margins, and compact convergence curves + MO final fronts (for s6 figures).
tests/benchmarks/test_stats.py (NEW) — synthetic known-overlap / known-non-overlap.
tests/benchmarks/test_doctests.py (NEW) — consolidated benchmark doctest gate (runtime glob of benchmarks/*.py + importorskip; auto-covers s6's render/reproduce later).

Headline result (30 seeds, overlapping variance, ctrl-freak vs each baseline)

SO: 6/6 problems EQUIVALENT on both error metrics (|f−f*|, ‖x−x*‖) vs pymoo AND deap. success_rate is 0 across all problems/libraries at the committed strict ε+budget (reported faithfully; SO parity is carried by the error metrics — a strict-threshold non-convergence property, not a failure).
MO: 4/6 EQUIVALENT (ZDT1/ZDT2/ZDT3/DTLZ2 — IGD+/GD/HV vs both). ZDT4/ZDT6 are two documented, in-our-favour exceptions: none of the three converges at this budget and ctrl-freak's IGD+/GD is lower (better) than both baselines (ZDT4 HV degenerate 0/0/0).

ctrl-freak is statistically indistinguishable from pymoo and DEAP on 10/12 problems and at-least-as-good on the 2 hardest — the citation shield.

Acceptance (verified)

pytest test_stats.py test_doctests.py --no-cov: 18 passed (gate discovers all 9 benchmark modules). --doctest-modules run.py stats.py --no-cov: 8 passed + 3 +SKIP (full-sweep doctests).
ruff check, ty check src/: clean. full uv run pytest: 523 passed @ 98.89%.
30-seed sweep reproducible via uv run python benchmarks/run.py (~20 min; fixed seeds → deterministic numbers).

…gate (milestone-12a)

CooperBigFoot · 2026-06-26T13:11:08Z

Adversarial review — APPROVE ✅

Fresh reviewer verified by execution in the worktree:

stats.py (N3): |mean_cf−mean_lib| < max(std_cf,std_lib), strict-<, reports margin; pure numpy. test_stats covers overlap/non-overlap/boundary/degenerate.
JSON integrity: n_seeds=30, pinned versions, budgets correct; 540 SO runs all 20,100 evals + 540 MO all 25,100; recomputed 8 verdict cells through stats.py — internally consistent; success_rate=0 everywhere (faithful). ZDT4/ZDT6 'not equivalent' in ctrl-freak's favour (e.g. zdt4 IGD+ cf 8.73 < pymoo 15.89/deap 14.84). Headline: SO 6/6 equivalent (both error metrics, both baselines); MO 4/6 (zdt1/2/3/dtlz2).
Reproducibility: build_results([0,1]) reproduces committed seed-0/1 raw bit-for-bit (0 mismatches).
Doctest gate: runtime rglob discovers all 9 benchmark modules + auto-covers s6's render/reproduce.
Acceptance: 18 + 8(+3 skip) doctests, ruff, ty src/, full pytest 523 @ 98.89%; green even under the numpy 2.0 floor.

Recommend merge.

CooperBigFoot added 2 commits June 26, 2026 14:38

S5: sweep runner + overlapping-variance stats + consolidated doctest …

d6a7ca0

…gate (milestone-12a)

S5: commit 30-seed sweep results artifact (benchmark_results.json)

15de0e1

CooperBigFoot merged commit 4c80f19 into main Jun 26, 2026
4 checks passed

CooperBigFoot deleted the milestone-12a/s5-runner branch June 26, 2026 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S5: sweep runner + overlapping-variance stats + committed 30-seed results (milestone-12a)#17

S5: sweep runner + overlapping-variance stats + committed 30-seed results (milestone-12a)#17
CooperBigFoot merged 2 commits into
mainfrom
milestone-12a/s5-runner

CooperBigFoot commented Jun 26, 2026

Uh oh!

CooperBigFoot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

CooperBigFoot commented Jun 26, 2026

s5-runner (milestone-12a) — sweep + overlapping-variance stats + committed results

Files

Headline result (30 seeds, overlapping variance, ctrl-freak vs each baseline)

Acceptance (verified)

Uh oh!

CooperBigFoot commented Jun 26, 2026

Adversarial review — APPROVE ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant