Skip to content

fix(ci): flag only event warnings still firing after the settle, not transients#1677

Closed
devantler wants to merge 1 commit into
mainfrom
fix/event-warnings-gate-transient
Closed

fix(ci): flag only event warnings still firing after the settle, not transients#1677
devantler wants to merge 1 commit into
mainfrom
fix/event-warnings-gate-transient

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Root cause

The check-event-warnings deploy gate is what failed the recent #1659 reconcile:

New Warning events since 2026-06-01T21:09:00Z (4)
  [flux-system] Pod/helm-controller-…      Unhealthy: Readiness probe failed: …:9440/readyz: connection refused
  [flux-system] Pod/kustomize-controller-… Unhealthy: Readiness probe failed: …:9440/readyz: connection refused
  [flux-system] Kustomization/infrastructure-controllers  HealthCheckFailed: … context canceled
  [flux-system] Pod/notification-controller-…  Unhealthy: Readiness probe failed: …:9440/readyz: connection refused

The action's docstring says it ignores transient one-shot warnings. But the implementation records the marker before the settle and flags any Warning with lastTimestamp >= marker. So a one-shot warning that fires during the settle window and then stops is flagged anyway — it only actually ignores transients that fired before the deploy started.

That false-positives on any deploy that rolls a workload during the settle. Concretely: applying the Flux-controller topologySpreadConstraints (#1659) — or any controller config change — restarts the controllers. While a controller restarts, its /readyz returns connection refused for a few seconds (leader-election handoff), and kustomize-controller restarting cancels the in-flight infrastructure-controllers health check (context canceled). All four events are transient and self-heal within seconds — the controllers are 1/1 Running immediately after — yet the gate failed the deploy.

Fix

Record the marker after the settle, then observe for a short window (new observe-seconds input, default 60):

settle (90s, grace — rollouts converge)  →  marker  →  observe (60s)  →  sample

Only warnings whose most-recent occurrence is at/after the post-settle marker are flagged — i.e. still firing once the cluster has settled, which is the documented "steady state" intent. Persistent problems (crash loops, FailedAttachVolume, missing PriorityClass — all of which fire continuously) keep firing into the observe window and are still caught; one-shot rollout blips that drain during the settle are not.

This is a correctness fix, not a loosening: it makes the gate do what its docstring already claims.

Validation

  • action.yaml parses; embedded bash passes bash -n.
  • Fed the gate's jq filter the exact 21:09:17 readiness-blip event (before a post-settle marker) plus a synthetic BackOff ×42 crash-loop (after it): only the crash-loop is flagged. ✅
  • No caller changes needed — all three call sites (ci.yaml system-test + merge_group, cd.yaml) use defaults; observe-seconds defaults to 60.

Relationship to other PRs

This unblocks #1659 (Flux controller spread) and any future controller-config deploy. Note #1661 ("require workloads to spread across nodes") merged recently — worth confirming whether #1659 is still needed or is now redundant with that policy.

…transients

The check-event-warnings gate records its marker BEFORE the settle, then
flags any Warning whose lastTimestamp is at/after the marker. So a one-shot
warning that fires DURING the settle window and then stops is flagged —
contradicting the action's own docstring ("transient one-shot warnings ...
are ignored"). In practice it only ignores transients that fired before the
marker, i.e. before the deploy even started.

This false-positives on any deploy that rolls a workload during the settle.
Concretely: applying the Flux-controller topologySpread (#1659) restarts the
controllers; their /readyz returns "connection refused" for a few seconds
during the restart, and kustomize-controller restarting cancels the
infrastructure-controllers health check. All transient and self-healed — yet
the gate failed the deploy on those four events.

Fix: record the marker AFTER the settle, then observe for a short window
(new observe-seconds input, default 60s). Only warnings still firing
at/after the post-settle marker are flagged — the documented "still firing
at steady state" intent. Persistent problems (crash loops, FailedAttachVolume,
missing PriorityClass) keep firing into the observe window and are still
caught; one-shot rollout blips that drain during the settle are not.

Verified: the jq filter, fed the exact 21:09:17 readiness-blip event plus a
synthetic BackOff crash-loop, now flags only the crash-loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 21:29
@github-project-automation github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes the check-event-warnings composite action so it only fails deployments when the cluster continues emitting Warning events after the settle period (steady state), rather than flagging transient warnings that occur during rollout convergence.

Changes:

  • Moves the warning “marker” timestamp to after the settle sleep, aligning behavior with the documented “steady state” intent.
  • Adds an observe-seconds input (default 60) to create a short post-settle observation window before sampling events.
  • Updates action documentation/messages to reflect the new settle → marker → observe → sample flow.

@devantler
Copy link
Copy Markdown
Contributor Author

Superseded by #1678: rather than fix the event-warnings gate's marker timing, we're removing the gate entirely and relying on Flux reconcile status (ksail workload reconcile already waits per-Kustomization; all Kustomizations are wait: true so Ready encodes Flux's health checks). Per maintainer preference for trusting Flux/ksail over bespoke CI heuristics.

@devantler devantler closed this Jun 1, 2026
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants