ci: cap cargo build parallelism to fix cachekit-lean MSRV OOM#26
Merged
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdds ChangesCI and Release Hardening
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The 1.85 (MSRV) matrix job has been red on main for 6+ days. This is not a code or MSRV regression: the crate compiles and tests clean on rustc 1.85.1 locally, and the committed Cargo.lock builds on 1.85. Root cause is an out-of-memory kill on the self-hosted runner. The cachekit-lean ARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no build cache, so the 1.85 job runs a fully cold `cargo test` build of the 246-crate async/TLS/crypto dependency graph. cargo defaults -j to the visible core count (~24 via the pod CPU limit), and at that fan-out the cold build's peak RSS — dominated by full test-profile debuginfo at link time — exceeds 6Gi. The kernel OOM-kills the linker, the runner loses communication with the server, and the job fails at the ~10-minute heartbeat reckoning with no logs uploaded (BlobNotFound). stable/beta pass because the preceding clippy step pre-builds the dependency graph, lowering the test step's peak; the 1.85 job skips clippy and concentrates the whole cold build into one step. Confirmed empirically: a cold `cargo test --no-run -j24` inside a 6Gi no-swap cgroup is OOM-killed during the cachekit-rs test-binary link (kernel: "Memory cgroup out of memory: Killed process ... (ld)"). Changes: - ci.yml: CARGO_BUILD_JOBS=4 caps concurrent compile/link jobs so peak RSS stays well under 6Gi on the cache-less lean pod. - ci.yml: timeout-minutes on the test (20) and wasm (15) jobs so a wedged runner fails fast instead of hanging to the heartbeat timeout. - release.yml: same CARGO_BUILD_JOBS cap on the publish job (also runs on cachekit-lean), plus a 1.85 MSRV compile check before publish so a broken floor cannot ship silently — release previously built stable only, which is how 0.3.0 published while the MSRV job was red. Refs #25
2be9535 to
71edf8f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the 6+ day red
1.85(MSRV) CI job onmain. It is not a code or MSRV bug — the crate compiles and tests clean on rustc 1.85.1 locally, and the committedCargo.lockbuilds on 1.85.The runner is being OOM-killed. The
cachekit-leanARC pod has a hard 6Gi memory cgroup (lab ADR-0001) and no build cache, so the1.85job runs a fully coldcargo testbuild of the 246-crate async/TLS/crypto graph.cargodefaults-jto the visible core count (~24 via the pod CPU limit), and at that fan-out the cold build's peak RSS — dominated by full test-profile debuginfo at link time — exceeds 6Gi. The kernel OOM-kills the linker, the runner "loses communication with the server," and the job fails at the ~10-minute heartbeat reckoning with no logs uploaded (theBlobNotFoundsymptom).stable/betapass because their preceding clippy step pre-builds the dependency graph, lowering theteststep's peak. The1.85job skips clippy, concentrating the whole cold build into one step.Evidence
Run testsfrozenin_progress, noComplete job, job failed at a clean ~10-min mark → killed mid-step, not a test failure.cargo test --no-run -j24inside a 6Gi no-swap cgroup is OOM-killed during thecachekit-rstest-binary link —kernel: Memory cgroup out of memory: Killed process … (ld)/scope: Failed with result 'oom-kill'.Changes
ci.yml—CARGO_BUILD_JOBS=4caps concurrent compile/link jobs so peak RSS stays well under 6Gi on the cache-less lean pod (the ~6× reduction in the fan-out is the fix).ci.yml—timeout-minutesontest(20) andwasm(15) so a wedged runner fails fast instead of hanging to the heartbeat timeout.release.yml— sameCARGO_BUILD_JOBScap on thepublishjob (also runs oncachekit-lean), plus a 1.85 MSRV compile check before publish so a broken floor can't ship silently —releasepreviously builtstableonly, which is how 0.3.0 published while the MSRV job was red.MSRV floor stays at 1.85 (deliberate, edition2024 — not bumped; bumping
-jvalue, not the toolchain, is the fix).Validation
The
1.85job on this PR runs at-j4oncachekit-lean— green here proves the fix in the real constrained environment. If it proves marginal at the link spike, theCARGO_BUILD_JOBSvalue will be lowered.Closes #25
Summary by CodeRabbit
Release Notes