Skip to content

S5: sweep runner + overlapping-variance stats + committed 30-seed results (milestone-12a)#17

Merged
CooperBigFoot merged 2 commits into
mainfrom
milestone-12a/s5-runner
Jun 26, 2026
Merged

S5: sweep runner + overlapping-variance stats + committed 30-seed results (milestone-12a)#17
CooperBigFoot merged 2 commits into
mainfrom
milestone-12a/s5-runner

Conversation

@CooperBigFoot

Copy link
Copy Markdown
Contributor

s5-runner (milestone-12a) — sweep + overlapping-variance stats + committed results

The reproducibility + statistics layer. benchmarks/run.py sweeps 12 problems × 3 libraries × 30 seeds, applies the metrics, aggregates mean±std + success_rate, computes the overlapping-variance equivalence verdict, and writes a self-describing JSON artifact. benchmarks/stats.py is the pure-numpy equivalence test (|mean_cf − mean_lib| < max(std_cf, std_lib), with reported margin — no t-test, no scipy).

Files

  • benchmarks/run.py (NEW) — the sweep + JSON writer (not pytest-collected).
  • benchmarks/stats.py (NEW) — overlapping-variance equivalence.
  • benchmarks/results/benchmark_results.json (NEW, committed) — the 30-seed artifact (1.0 MiB) so every reported number traces to a file. Embeds the 30 seeds, pinned versions (ctrl_freak 0.1.0, pymoo 0.6.1.6, deap 1.4.3, numpy 2.4.1), budgets, the statistic definition, raw per-seed, aggregates, equivalence verdicts+margins, and compact convergence curves + MO final fronts (for s6 figures).
  • tests/benchmarks/test_stats.py (NEW) — synthetic known-overlap / known-non-overlap.
  • tests/benchmarks/test_doctests.py (NEW) — consolidated benchmark doctest gate (runtime glob of benchmarks/*.py + importorskip; auto-covers s6's render/reproduce later).

Headline result (30 seeds, overlapping variance, ctrl-freak vs each baseline)

  • SO: 6/6 problems EQUIVALENT on both error metrics (|f−f*|, ‖x−x*‖) vs pymoo AND deap. success_rate is 0 across all problems/libraries at the committed strict ε+budget (reported faithfully; SO parity is carried by the error metrics — a strict-threshold non-convergence property, not a failure).
  • MO: 4/6 EQUIVALENT (ZDT1/ZDT2/ZDT3/DTLZ2 — IGD+/GD/HV vs both). ZDT4/ZDT6 are two documented, in-our-favour exceptions: none of the three converges at this budget and ctrl-freak's IGD+/GD is lower (better) than both baselines (ZDT4 HV degenerate 0/0/0).

ctrl-freak is statistically indistinguishable from pymoo and DEAP on 10/12 problems and at-least-as-good on the 2 hardest — the citation shield.

Acceptance (verified)

  • pytest test_stats.py test_doctests.py --no-cov: 18 passed (gate discovers all 9 benchmark modules). --doctest-modules run.py stats.py --no-cov: 8 passed + 3 +SKIP (full-sweep doctests).
  • ruff check, ty check src/: clean. full uv run pytest: 523 passed @ 98.89%.
  • 30-seed sweep reproducible via uv run python benchmarks/run.py (~20 min; fixed seeds → deterministic numbers).

@CooperBigFoot

Copy link
Copy Markdown
Contributor Author

Adversarial review — APPROVE ✅

Fresh reviewer verified by execution in the worktree:

  • stats.py (N3): |mean_cf−mean_lib| < max(std_cf,std_lib), strict-<, reports margin; pure numpy. test_stats covers overlap/non-overlap/boundary/degenerate.
  • JSON integrity: n_seeds=30, pinned versions, budgets correct; 540 SO runs all 20,100 evals + 540 MO all 25,100; recomputed 8 verdict cells through stats.py — internally consistent; success_rate=0 everywhere (faithful). ZDT4/ZDT6 'not equivalent' in ctrl-freak's favour (e.g. zdt4 IGD+ cf 8.73 < pymoo 15.89/deap 14.84). Headline: SO 6/6 equivalent (both error metrics, both baselines); MO 4/6 (zdt1/2/3/dtlz2).
  • Reproducibility: build_results([0,1]) reproduces committed seed-0/1 raw bit-for-bit (0 mismatches).
  • Doctest gate: runtime rglob discovers all 9 benchmark modules + auto-covers s6's render/reproduce.
  • Acceptance: 18 + 8(+3 skip) doctests, ruff, ty src/, full pytest 523 @ 98.89%; green even under the numpy 2.0 floor.

Recommend merge.

@CooperBigFoot CooperBigFoot merged commit 4c80f19 into main Jun 26, 2026
4 checks passed
@CooperBigFoot CooperBigFoot deleted the milestone-12a/s5-runner branch June 26, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant