ci: move mutants-cli off lean-mem to rust-cpu (#523)#526
Draft
avrabe wants to merge 1 commit into
Draft
Conversation
`mutants-cli` (`Mutation Testing (rivet-cli)`) was running on every PR and
push pinned to the 4-runner `lean-mem` pool — the one runner class with no
spare capacity. A 14-day audit of the self-hosted fleet showed it as the
single largest consumer of that pool (488 instances), and as a direct
consequence Miri (~17 h median wait, 43% fail rate) and Verus (~18 h
median wait, 94% fail rate) were starving against `cancel-in-progress`
PR-push churn while `rust-cpu` sat 86% idle.
The fix is a one-line runner-pool change: `rivet-cli` is the small crate
running `--jobs 2` with `--timeout 30`; the `rust-cpu` class (16 G
`MemoryHigh`, 7 runners) handles it without contention. Per-PR mutation
coverage is preserved, no cadence change is needed, and `lean-mem` is
freed up for the genuinely RAM-bound gating jobs (Miri, Verus) plus the
nightly `mutants-core` fan-out.
Also extends the surrounding comment block to document why this pool
choice matters so future drift doesn't quietly re-pin to `lean-mem`.
The post-merge bullet of the issue's Acceptance ("lean-mem median job
wait drops back under a few minutes") can only be confirmed by operator
observation against the runner pool after this lands; the in-repo bullet
("mutants-cli no longer runs on lean-mem") is the diff itself.
Note: the pulseengine.eu/blog/ workflow guidance was HTTP 503 throughout
this triage run (same symptom carried across #420 / #516 / #522 / …), so
this PR ships as a draft for maintainer review against the authoritative
process posts once the blog is reachable.
Refs: #523, #509
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.
| Benchmark suite | Current: 4cc7823 | Previous: e60a3a9 | Ratio |
|---|---|---|---|
traceability_matrix/1000 |
58318 ns/iter (± 645) |
43193 ns/iter (± 499) |
1.35 |
query/10000 |
333406 ns/iter (± 1298) |
236806 ns/iter (± 4501) |
1.41 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #523 — implements the "Recommended fix (cheapest first)" option 1 from the issue body (author's stated "strongly preferred" choice).
Why
mutants-cliwas running per-PR + per-push pinned to the 4-runnerlean-mempool. The 14-day audit in #523 measured it as the single largest consumer of that class (488 instances over 14 days against 4 runners), and the operator-visible consequence was that Miri (~17 h median wait, 43% fail rate) and Verus (~18 h median wait, 94% fail rate) were starving — queueing for a runner against the per-PR mutation churn whilerust-cpusat 86% idle.This PR makes the one-line runner-pool change the issue body recommends and expands the surrounding comment block so the rationale survives drift.
Acceptance criteria (from #523)
mutants-clino longer runs onlean-mem.runs-on:on.github/workflows/ci.yml:597now resolves to[self-hosted, linux, x64, rust-cpu]. Grep confirmsmutants-cliis the onlylean-memuser removed; Miri (line 416), nightlymutants-core(line 496), and the other comment-only references tolean-memare untouched.gh api ... jobs?status=completed, filter byrunner_labels) should show the lean-mem median fall back from ~64 min to single-digit minutes within a day or two.Why this is a draft
Same reason as #525 (and as carried across the recent triage threads): the hard triage rule requires consulting https://pulseengine.eu/blog/ before opening a PR, and the blog has been HTTP 503 throughout this run. The fix itself matches the author's "strongly preferred" option in the issue body verbatim, so the draft state is purely about clearing the workflow-guidance hard rule once the blog is reachable.
The change is small, comment-heavy, and reversible:
mutants-cli: name: Mutation Testing (rivet-cli) needs: [test] - runs-on: [self-hosted, linux, x64, lean-mem] + runs-on: [self-hosted, linux, x64, rust-cpu](plus a 6-line comment block above documenting why this pool, so drift doesn't quietly re-pin.)
Related
Generated by Claude Code — issue-triage agent run 2026-06-10.
Generated by Claude Code