Skip to content

ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#26

Merged
27Bslash6 merged 1 commit into
mainfrom
ci/fix-lean-runner-oom
Jun 17, 2026
Merged

ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#26
27Bslash6 merged 1 commit into
mainfrom
ci/fix-lean-runner-oom

Conversation

@27Bslash6

@27Bslash6 27Bslash6 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the 6+ day red 1.85 (MSRV) CI job on main. It is not a code or MSRV bug — the crate compiles and tests clean on rustc 1.85.1 locally, and the committed Cargo.lock builds on 1.85.

The runner is being OOM-killed. The cachekit-lean ARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no build cache, so the 1.85 job runs a fully cold cargo test build of the 246-crate async/TLS/crypto graph. cargo defaults -j to the visible core count (~24 via the pod CPU limit), and at that fan-out the cold build's peak RSS — dominated by full test-profile debuginfo at link time — exceeds 6Gi. The kernel OOM-kills the linker, the runner "loses communication with the server," and the job fails at the ~10-minute heartbeat reckoning with no logs uploaded (the BlobNotFound symptom).

stable/beta pass because their preceding clippy step pre-builds the dependency graph, lowering the test step's peak. The 1.85 job skips clippy, concentrating the whole cold build into one step.

Evidence

  • Job annotation (survives the lost log blob): "The self-hosted runner lost communication with the server… terminates the runner process, starves it for CPU/Memory, or blocks its network access."
  • Step timeline: Run tests frozen in_progress, no Complete job, job failed at a clean ~10-min mark → killed mid-step, not a test failure.
  • Empirical repro: a cold cargo test --no-run -j24 inside a 6Gi no-swap cgroup is OOM-killed during the cachekit-rs test-binary link —
    kernel: Memory cgroup out of memory: Killed process … (ld) / scope: Failed with result 'oom-kill'.
  • Non-deterministic across re-runs (passed once, failed twice) — the signature of a resource-edge, not a deterministic code bug.

Changes

  • ci.ymlCARGO_BUILD_JOBS=4 caps concurrent compile/link jobs so peak RSS stays well under 6Gi on the cache-less lean pod (the ~6× reduction in the fan-out is the fix).
  • ci.ymltimeout-minutes on test (20) and wasm (15) so a wedged runner fails fast instead of hanging to the heartbeat timeout.
  • release.yml — same CARGO_BUILD_JOBS cap on the publish job (also runs on cachekit-lean), plus a 1.85 MSRV compile check before publish so a broken floor can't ship silently — release previously built stable only, which is how 0.3.0 published while the MSRV job was red.

MSRV floor stays at 1.85 (deliberate, edition2024 — not bumped; bumping -j value, not the toolchain, is the fix).

Validation

The 1.85 job on this PR runs at -j4 on cachekit-lean — green here proves the fix in the real constrained environment. If it proves marginal at the link spike, the CARGO_BUILD_JOBS value will be lowered.

Note: the broken ARC log persistence (BlobNotFound) is a separate runner-side observability defect tracked outside this repo; preventing the OOM is what makes the job pass.

Closes #25

Summary by CodeRabbit

Release Notes

  • Chores
    • Improved build stability and release pipeline reliability through enhanced resource management and validation checks.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ebf845e-51d9-41c4-b441-0b893e72ae77

📥 Commits

Reviewing files that changed from the base of the PR and between f381e41 and 71edf8f.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml
  • .github/workflows/release.yml

Walkthrough

Adds CARGO_BUILD_JOBS: "4" and timeout-minutes guards to the test, wasm, and publish jobs across ci.yml and release.yml to cap Rust build parallelism and prevent OOM/hang scenarios. The release workflow gains an explicit MSRV gate that installs Rust 1.85 and runs cargo check before publishing.

Changes

CI and Release Hardening

Layer / File(s) Summary
Build parallelism caps and job timeouts
.github/workflows/ci.yml, .github/workflows/release.yml
Adds CARGO_BUILD_JOBS: "4" to the global env in ci.yml and to the publish job env in release.yml. Sets timeout-minutes: 20 on the test job, timeout-minutes: 15 on the wasm job, and timeout-minutes: 30 on the publish job.
MSRV verification gate in release workflow
.github/workflows/release.yml
Adds a pre-publish step that installs Rust 1.85 via rustup and runs cargo +1.85 check --all-targets with the full feature set before any crate is published.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: capping cargo build parallelism to fix MSRV OOM issues on the cachekit-lean runner.
Linked Issues check ✅ Passed The PR successfully addresses the core objectives from issue #25: capping build parallelism fixes the OOM, adding timeouts prevents hangs, and the MSRV gate prevents silent regressions.
Out of Scope Changes check ✅ Passed All changes directly target the documented problem: parallelism caps, job timeouts, and MSRV verification are all within scope of fixing the red CI issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/fix-lean-runner-oom

Comment @coderabbitai help to get the list of available commands and usage tips.

The 1.85 (MSRV) matrix job has been red on main for 6+ days. This is not
a code or MSRV regression: the crate compiles and tests clean on rustc
1.85.1 locally, and the committed Cargo.lock builds on 1.85.

Root cause is an out-of-memory kill on the self-hosted runner. The
cachekit-lean ARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no
build cache, so the 1.85 job runs a fully cold `cargo test` build of the
246-crate async/TLS/crypto dependency graph. cargo defaults -j to the
visible core count (~24 via the pod CPU limit), and at that fan-out the
cold build's peak RSS — dominated by full test-profile debuginfo at link
time — exceeds 6Gi. The kernel OOM-kills the linker, the runner loses
communication with the server, and the job fails at the ~10-minute
heartbeat reckoning with no logs uploaded (BlobNotFound). stable/beta
pass because the preceding clippy step pre-builds the dependency graph,
lowering the test step's peak; the 1.85 job skips clippy and concentrates
the whole cold build into one step.

Confirmed empirically: a cold `cargo test --no-run -j24` inside a 6Gi
no-swap cgroup is OOM-killed during the cachekit-rs test-binary link
(kernel: "Memory cgroup out of memory: Killed process ... (ld)").

Changes:
- ci.yml: CARGO_BUILD_JOBS=4 caps concurrent compile/link jobs so peak
  RSS stays well under 6Gi on the cache-less lean pod.
- ci.yml: timeout-minutes on the test (20) and wasm (15) jobs so a
  wedged runner fails fast instead of hanging to the heartbeat timeout.
- release.yml: same CARGO_BUILD_JOBS cap on the publish job (also runs on
  cachekit-lean), plus a 1.85 MSRV compile check before publish so a
  broken floor cannot ship silently — release previously built stable
  only, which is how 0.3.0 published while the MSRV job was red.

Refs #25
@27Bslash6 27Bslash6 force-pushed the ci/fix-lean-runner-oom branch from 2be9535 to 71edf8f Compare June 17, 2026 10:09
@27Bslash6 27Bslash6 merged commit e0da302 into main Jun 17, 2026
5 checks passed
@27Bslash6 27Bslash6 deleted the ci/fix-lean-runner-oom branch June 17, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI red on main: 1.85 (MSRV) job fails reproducibly — code passes locally, ARC runner logs not persisted

1 participant