S5: sweep runner + overlapping-variance stats + committed 30-seed results (milestone-12a)#17
Merged
Merged
Conversation
Contributor
Author
Adversarial review — APPROVE ✅Fresh reviewer verified by execution in the worktree:
Recommend merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
s5-runner (milestone-12a) — sweep + overlapping-variance stats + committed results
The reproducibility + statistics layer.
benchmarks/run.pysweeps 12 problems × 3 libraries × 30 seeds, applies the metrics, aggregates mean±std + success_rate, computes the overlapping-variance equivalence verdict, and writes a self-describing JSON artifact.benchmarks/stats.pyis the pure-numpy equivalence test (|mean_cf − mean_lib| < max(std_cf, std_lib), with reported margin — no t-test, no scipy).Files
benchmarks/run.py(NEW) — the sweep + JSON writer (not pytest-collected).benchmarks/stats.py(NEW) — overlapping-variance equivalence.benchmarks/results/benchmark_results.json(NEW, committed) — the 30-seed artifact (1.0 MiB) so every reported number traces to a file. Embeds the 30 seeds, pinned versions (ctrl_freak 0.1.0, pymoo 0.6.1.6, deap 1.4.3, numpy 2.4.1), budgets, the statistic definition, raw per-seed, aggregates, equivalence verdicts+margins, and compact convergence curves + MO final fronts (for s6 figures).tests/benchmarks/test_stats.py(NEW) — synthetic known-overlap / known-non-overlap.tests/benchmarks/test_doctests.py(NEW) — consolidated benchmark doctest gate (runtime glob ofbenchmarks/*.py+ importorskip; auto-covers s6's render/reproduce later).Headline result (30 seeds, overlapping variance, ctrl-freak vs each baseline)
ctrl-freak is statistically indistinguishable from pymoo and DEAP on 10/12 problems and at-least-as-good on the 2 hardest — the citation shield.
Acceptance (verified)
pytest test_stats.py test_doctests.py --no-cov: 18 passed (gate discovers all 9 benchmark modules).--doctest-modules run.py stats.py --no-cov: 8 passed + 3 +SKIP (full-sweep doctests).ruff check,ty check src/: clean. fulluv run pytest: 523 passed @ 98.89%.uv run python benchmarks/run.py(~20 min; fixed seeds → deterministic numbers).