Skip to content

ci: move mutants-cli off lean-mem to rust-cpu (#523)#526

Draft
avrabe wants to merge 1 commit into
mainfrom
ci/issue-523-mutants-cli-runner-pool
Draft

ci: move mutants-cli off lean-mem to rust-cpu (#523)#526
avrabe wants to merge 1 commit into
mainfrom
ci/issue-523-mutants-cli-runner-pool

Conversation

@avrabe

@avrabe avrabe commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Closes #523 — implements the "Recommended fix (cheapest first)" option 1 from the issue body (author's stated "strongly preferred" choice).

Why

mutants-cli was running per-PR + per-push pinned to the 4-runner lean-mem pool. The 14-day audit in #523 measured it as the single largest consumer of that class (488 instances over 14 days against 4 runners), and the operator-visible consequence was that Miri (~17 h median wait, 43% fail rate) and Verus (~18 h median wait, 94% fail rate) were starving — queueing for a runner against the per-PR mutation churn while rust-cpu sat 86% idle.

This PR makes the one-line runner-pool change the issue body recommends and expands the surrounding comment block so the rationale survives drift.

Acceptance criteria (from #523)

  • mutants-cli no longer runs on lean-mem. runs-on: on .github/workflows/ci.yml:597 now resolves to [self-hosted, linux, x64, rust-cpu]. Grep confirms mutants-cli is the only lean-mem user removed; Miri (line 416), nightly mutants-core (line 496), and the other comment-only references to lean-mem are untouched.
  • lean-mem median job wait drops back under a few minutes; Miri/Verus stop hitting multi-hour queues. This is environmental: only verifiable by operator observation against the runner pool after this lands. Once merged, the same query that produced the audit table in ci: mutants-cli runs per-PR on the scarce lean-mem pool, starving Miri/Verus (14-day audit) #523 (gh api ... jobs?status=completed, filter by runner_labels) should show the lean-mem median fall back from ~64 min to single-digit minutes within a day or two.

Why this is a draft

Same reason as #525 (and as carried across the recent triage threads): the hard triage rule requires consulting https://pulseengine.eu/blog/ before opening a PR, and the blog has been HTTP 503 throughout this run. The fix itself matches the author's "strongly preferred" option in the issue body verbatim, so the draft state is purely about clearing the workflow-guidance hard rule once the blog is reachable.

The change is small, comment-heavy, and reversible:

   mutants-cli:
     name: Mutation Testing (rivet-cli)
     needs: [test]
-    runs-on: [self-hosted, linux, x64, lean-mem]
+    runs-on: [self-hosted, linux, x64, rust-cpu]

(plus a 6-line comment block above documenting why this pool, so drift doesn't quietly re-pin.)

Related


Generated by Claude Code — issue-triage agent run 2026-06-10.


Generated by Claude Code

`mutants-cli` (`Mutation Testing (rivet-cli)`) was running on every PR and
push pinned to the 4-runner `lean-mem` pool — the one runner class with no
spare capacity. A 14-day audit of the self-hosted fleet showed it as the
single largest consumer of that pool (488 instances), and as a direct
consequence Miri (~17 h median wait, 43% fail rate) and Verus (~18 h
median wait, 94% fail rate) were starving against `cancel-in-progress`
PR-push churn while `rust-cpu` sat 86% idle.

The fix is a one-line runner-pool change: `rivet-cli` is the small crate
running `--jobs 2` with `--timeout 30`; the `rust-cpu` class (16 G
`MemoryHigh`, 7 runners) handles it without contention. Per-PR mutation
coverage is preserved, no cadence change is needed, and `lean-mem` is
freed up for the genuinely RAM-bound gating jobs (Miri, Verus) plus the
nightly `mutants-core` fan-out.

Also extends the surrounding comment block to document why this pool
choice matters so future drift doesn't quietly re-pin to `lean-mem`.

The post-merge bullet of the issue's Acceptance ("lean-mem median job
wait drops back under a few minutes") can only be confirmed by operator
observation against the runner pool after this lands; the in-repo bullet
("mutants-cli no longer runs on lean-mem") is the diff itself.

Note: the pulseengine.eu/blog/ workflow guidance was HTTP 503 throughout
this triage run (same symptom carried across #420 / #516 / #522 / …), so
this PR ships as a draft for maintainer review against the authoritative
process posts once the blog is reachable.

Refs: #523, #509

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 4cc7823 Previous: e60a3a9 Ratio
traceability_matrix/1000 58318 ns/iter (± 645) 43193 ns/iter (± 499) 1.35
query/10000 333406 ns/iter (± 1298) 236806 ns/iter (± 4501) 1.41

This comment was automatically generated by workflow using github-action-benchmark.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: mutants-cli runs per-PR on the scarce lean-mem pool, starving Miri/Verus (14-day audit)

2 participants