Skip to content

fix(validate-go): bound govulncheck memory so it stops OOM-killing the runner#270

Merged
devantler merged 1 commit into
mainfrom
claude/govulncheck-oom-resilience
Jun 1, 2026
Merged

fix(validate-go): bound govulncheck memory so it stops OOM-killing the runner#270
devantler merged 1 commit into
mainfrom
claude/govulncheck-oom-resilience

Conversation

@devantler
Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Fixes #269

Problem

The 🛡️ Vulnerability Scan job (govulncheck ./..., added in #266) kills the runner mid-scan on large Go modules. govulncheck's default -scan symbol builds a whole-program call graph whose peak memory grows with the dependency graph; on a k8s-scale module (ksail imports Kubernetes/Flux/Talos/Omni clients) it exhausts the hosted runner's 16 GiB and the host terminates the runner — exit 143 / "The runner has received a shutdown signal" / "The operation was canceled", with no govulncheck output. It's an effective gate (org code_quality ruleset on consumers), so it blocks the PR, and re-running just reproduces it.

Two consecutive ksail#4982 runs both died ~2 min into the scan with zero output — see #269 for the run-by-run evidence. This regression (mine, from #266 ~5h ago) blocks every large-repo Go PR, not just ksail.

Change

     permissions:
       contents: read
+    timeout-minutes: 15
+    env:
+      GOMEMLIMIT: 12GiB
  • GOMEMLIMIT=12GiB — caps the Go runtime heap so the GC reclaims aggressively and stays under the host ceiling instead of OOM-killing. The standard Go remedy for CI OOMs. (ubuntu-latest = 16 GiB; 12 GiB leaves headroom for the OS, Go toolchain subprocesses and harden-runner.)
  • timeout-minutes: 15 — bounds any remaining worst case to a fast, legible failure instead of a hung runner.

Reachability semantics are unchanged — still -scan symbol, still exit-3-on-reachable-vuln only.

Validation

  • actionlint: no new findings on the diff (the only report is the pre-existing code-quality: write scope, unrelated to this change).
  • GOMEMLIMIT=12GiB is valid Go syntax.
  • Self-verifying once promoted: this PR's own CI won't run the job (it's if: github.repository != 'devantler-tech/reusable-workflows'), so the real proof is re-running ksail#4982's Vulnerability Scan against this branch's reusable workflow once it ships. I'll verify the rollout after merge.

Trade-off / fallback (maintainer decision)

If a module's live call-graph genuinely exceeds the GOMEMLIMIT headroom, this converts the OOM into a clean timeout but the gate still can't pass. The robust fallback is switching the gate to govulncheck -scan module ./... (deterministic, low-memory; catches every known-vulnerable dependency version), at the cost of losing reachability filtering. That's a deliberate semantic change, so it's documented in #269 rather than applied here.

…e runner

govulncheck's default symbol scan builds a whole-program call graph whose
peak memory grows with the module's dependency graph. On large modules
(e.g. ksail, which imports Kubernetes/Flux/Talos/Omni clients) the scan
exhausts the hosted runner's 16 GiB and the host kills the runner mid-scan
(exit 143 / "runner has received a shutdown signal") with no govulncheck
output — an opaque, retry-resistant gate failure that blocks every large-repo
Go PR. Re-running just reproduces it.

Cap the Go runtime heap with GOMEMLIMIT=12GiB so the GC reclaims aggressively
and stays under the host ceiling instead of OOM-killing, and add
timeout-minutes: 15 so any remaining worst case fails fast and legibly rather
than hanging. Symbol-scan reachability semantics are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 12:21
@github-project-automation github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Mitigates GitHub-hosted runner OOM terminations during the govulncheck vulnerability scan in the validate-go-project reusable workflow by bounding Go runtime heap usage and limiting maximum job runtime.

Changes:

  • Add a job-level GOMEMLIMIT to cap Go heap usage during govulncheck runs.
  • Add a timeout-minutes limit to ensure the scan fails fast and visibly instead of hanging or being OOM-killed.
  • Document the rationale inline in the workflow for future maintainers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@devantler devantler marked this pull request as ready for review June 1, 2026 13:07
@devantler devantler merged commit f3d334f into main Jun 1, 2026
41 checks passed
@devantler devantler deleted the claude/govulncheck-oom-resilience branch June 1, 2026 13:14
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 1, 2026
@botantler
Copy link
Copy Markdown
Contributor

botantler Bot commented Jun 1, 2026

🎉 This PR is included in version 5.3.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released an issue that has been solved in a release label Jun 1, 2026
botantler Bot pushed a commit that referenced this pull request Jun 1, 2026
…#272)

* feat(validate-go): risk-acceptance allowlist for the govulncheck gate

The hard `govulncheck ./...` gate (introduced in #266, un-OOMed in #270/v5.3.2)
is unsatisfiable for large consumers: it fails on reachable advisories that
have no upstream fix (`Fixed in: N/A`), wedging every Go PR through no fault
of the PR. Scan in JSON mode and fail only on reachable findings whose ID is
not in an optional consumer-owned `.govulncheck-allow.txt`. With no allowlist
file the behaviour is unchanged (strict).

Fixes #271

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: re-trigger CI (transient GitHub-API 500 in delete-workflow-runs dry-run test)

The `[Test] Delete Workflow Runs - All Workflows` job hit a GitHub-API HTTP 500
("other side closed") while paginating runs in dry-run mode; its sibling
Minimal/Specific variants passed. Unrelated to this diff (validate-go jobs skip
on this repo). Empty re-trigger commit (same tree) to re-run CI - Required Checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

released an issue that has been solved in a release

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

fix(validate-go): govulncheck Vulnerability Scan OOM-kills the runner on large modules

2 participants